Method of reducing redundancy between two or more datasets

ABSTRACT

A method for reducing redundancy between two or more datasets of potentially very large size. The method improves upon current technology by oversubscribing the data structure that represents a digest of data blocks and using positional information about matching data so that very large datasets can be analyzed and the redundancies removed by, having found a match on digest, expands the match in both directions in order to detect and eliminate large runs of data by replace duplicate runs with references to common data. The method is particularly useful for capturing the states of images of a hard disk. The method permits several files to have their redundancy removed and the files to later be reconstituted. The method is appropriate for use on a WORM device. The method can also make use of L2 cache to improve performance.

RELATED APPLICATIONS

Any and all applications for which a foreign or domestic priority claimis identified in the Application Data Sheet, or any correction thereto,are hereby incorporated by reference into this application under 37 CFR1.57.

FIELD OF THE INVENTION

This invention relates generally to a system and method for storingdata. More particularly, this invention relates to a form of data sizereduction which involves finding redundancies in and between large datasets (files) and eliminating these redundancies in order to conserverepository memory (generally, disk space).

BACKGROUND OF THE INVENTION

This invention relates generally to a system and method for storingdata. More particularly, this invention relates to storing dataefficiently in both the time and space domains by removing redundantdata between two or more data sets.

The inventors of this invention noticed that there are many times whenvery large datasets are created by users in which there is a great dealof commonality between and among the datasets. In some cases, forinstance, there may be more than 99% of common data between thesedatasets and these datasets may share long runs of identical dataalthough the data may be in different locations in the datasets and/orthere may be relatively small insertions and deletions of data. Some,non-exhaustive, examples of when such commonalities occur are described,below.

The method of this invention attempts to find the common “long runs ofdata” and use this information to remove this redundancy so as to saverepository space.

As an example of when commonalities occur, datasets on the order of tensof gigabytes or larger are commonly created as the products of modernbackup operations on computers. Maintaining multiple versions of suchdatasets consumes a great deal of disk space, so that only a relativelyfew, if any, versions are, generally, kept. However, since the contentsof each version of a backup for the same machine have a great deal incommon with the contents of the other versions, it should be possible toreduce the total storage required significantly. In many cases thiscould allow the retention of hundreds or thousands or more of versionsin the space now occupied by a dozen or so.

In a typical single-computer environment, the computer has an activeRepository that contains information of importance to the owner of thecomputer. In order to preserve this information (data) from accidentalloss or malicious deletion, one or more backups of the information isstored in one or more backup Repositories; such backups are kept in adataset that is reified in one or more computer files.

Modern computers commonly use one of three types of Repositories:rotating-media-based, tape-based, and less commonly (as we write this in2009) solid-state-memory-based. This Method applies to disk-based andsolid-state-memory-based Repositories although, theoretically, thismethod can apply to tape-based backups; as those familiar with the artunderstand, a Turing machine, using only sequential storage, can emulatea disk-based machine.

Users of this Method will find this method particularly efficacious whenused with Repositories with the hardware property of “read many, writemany” or “read many, write few” or “read many, write once”.

Modern disk drives are almost always of the type “read many, writemany”. Modern disk drives also have the undesirable property that randomseeks take milliseconds of time.

Modern solid-state memory has the undesirable properties of being farmore expensive per byte than disk-based memory as well as (with sometypes of solid-state memory) limiting the number of rewrites before thedevice fails to be rewriteable. It has the desirable property thatrandom seeks in the device's memory may be computationally almost costfree in the time domain.

We will teach how our Method can be adjusted to take advantage of bothdisk-based memory as well as solid-state memory to take advantage ofeach kind of memory's particular performance characteristics.

One of the many ways to perform a backup is to take a so-called “ImageBackup” of the Repository. In an Image Backup, the bit-pattern of thedata in the Repository is stored somewhere so that—should it benecessary—the exact bit pattern of the original Repository can berecreated. This bit pattern is sometimes referred to as a ForensicBackup. A full Forensic Backup is independent of any operating systemsince all that is recorded is the bit pattern in the Repository. A fullForensic Backup need not “understand” the contents of the Repository.

Very often, the program that takes the Image Backup also knows whichareas of the Repository have usable (that is, allocated) data and whichareas are free to be used by an operating system to store new data.

A user's Repository is often broken up into logically contiguous areasknown as partitions. Generally, during routine computer operation, onlyone operating system has access to any particular partition.

Operating systems are generally responsible for allocating space insideof a partition. Although an operating system might allocate space withvariable length bit or byte allocations, generally, modern operatingsystems break up a partition into thousands to billions of fixed sizepieces often called sectors. The industry has settled on a typicalsector size of 512 bytes. A group of one or more logically contiguoussectors is often called a cluster. The operating system often has a bitmap of one-bit-per-cluster indicating which clusters have been allocatedto users by the operating system and which are free to be allocated; insome Microsoft systems this bit map is called the File Allocation Table(FAT). Files that have been deleted will often have the correspondingbits associated with the file changed in the allocation table fromallocated to free.

The method we teach is not sensitive to the mechanism used for trackingallocated and free space.

The internal data structures that determine which data is allocated orunallocated are operating system dependent. Thus the program that isdoing the backup must be aware of the particular operating system'sso-called allocation map (data structure, e.g. FAT) if it is to not copyunallocated space to the backup repository. Backup programs normallydepend on the dual facts that there are only a handful of commonly usedoperating systems and that the operating systems leave clues at thebeginning of disk partitions or in a so-called Partition Table as towhat internal data structure (e.g. FAT, FAT32, NTFS, HPFS, etc.) aparticular partition is formatted for.

Unallocated space on a computer's disk generally has a(more-or-less-random) bit pattern but that bit pattern is generally oflittle use to the owner of the computer's disk. Unallocated data might,for instance, be space associated with page files or deleted files. Thebackup program could (and often does) optionally only back up theallocated data. Nonetheless, this is still considered an Image Backup ofthe data in the Repository.

In the alternative, backup programs are written in such a manner thatthey ask for the assistance of the operating system to fetch files fromthe Repository and store those files in a backup dataset(s).

Modern computers typically use disk drives as their main Repository.

Because computer hardware and software fail and, in particular, becausedisk drives fail, it is the wise custom to take and keep several backupsof the Repository. A modern, circa 2009, typical business or home user'scomputer's disk drives have storage on the order of magnitude of 500gigabytes of which about half is used. These are typical figures but theactual values vary greatly.

Assuming the typical values, above, an image backup will consume about250 gigabytes if the data is uncompressed. The size of the backup willlikely be reduced roughly another thirty percent if the backup programcompresses the data. The aforementioned thirty percent will varydepending on the user's underlying data as well as the compressionalgorithm. Compression above fifty percent for typical allocated data isunusually high. At a backup speed of approximately thirty megabytes asecond, an Allocated Image backup of the Repository will take about anhour. The backup speed will vary depending on many factors includingwhether the user is backing up the disk to another disk, the speed ofthe disk, the speed of the processor in its ability to compress rawdata, etc.

Because users of computers accidentally delete files or their computersbecome infected by computer viruses, users often wish to retain multiplebackups over time. Sometimes these same users wish to maintain multiplebackups using proprietary formats from multiple backup software vendors.

At current 2009 prices, 1000 gigabytes of disk storage costs about $100.Thus, an unsophisticated backup scheme of uncompressed 250 gigabytes(125 gigabytes compressed) would cost the user about $12 for eachbackup, quickly limiting the number of backups the user maintains. Thisinvention will allow the user to maintain many versions of the backupsby eliminating redundancies between and among the backups.

As those familiar with the art understand, limiting factors in removingredundancy include but are not limited to (a) random access speed, and(b) the amount of RAM available to maintain tables representing the datafor which redundancy is to be eliminated.

IV.1 Other Applications of the Method

While the previous discussion focused on image-based backups, thisMethod is not limited to that scenario.

There are many other times in which two large files will have a greatdeal of the kind of commonality described, above.

For instance, it is quite likely that, say, file-oriented backups of acustomer database will have a great deal of commonality across time.There tend to be many customers who have no activity between backups andthose customers who do have activity generally have a small number oftransactions to be recorded.

This Method applies to such scenarios and datasets, as well.

SUMMARY OF THE INVENTION V.(1) Efficacy of the Invention

This invention relates generally to a system and method for storingdata. More particularly, this invention relates to a form of data sizereduction that involves finding redundancies in and between large datasets (files) and eliminating these redundancies in order to conserverepository memory (generally, disk space).

As currently implemented prior to this invention, the random accessmemory (RAM) requirements to remove redundancies in, say, a pair ofmulti-terabyte files, exceed the capacity of most modern home computersand de-duplicating appliances. This Method vastly reduces the RAMrequirements to remove much of the duplicate data between these twofiles as well as removing duplicate data within the files themselves.

V.(2) Definition of Local Information

To make our teaching in this Summary of the Invention somewhat easier,we define “Local Information” to mean a range of a large file that issignificantly smaller than the size of said file.

Thus, if we are positioned at byte 5,000,000,000,000 in a seven-terabytefile, the fuzzy notion of Local Information might be several kilobytes,or several hundred megabytes, or several gigabytes of data on eitherside of said 5,000,000,000,000 position.

V.(3) Kinds of Compression and Data Deduplication

With some kinds of data, such as still pictures, audio, and video, datasize reduction algorithms can take advantage of the fact that the humaneye and ear have limited ability to distinguish certain features ofvisual or sonic data. For this reason, these types of data can bereduced in size in a lossy way; that is, the original data cannot berecovered exactly bit for bit, but the consumer of the data finds thealterations acceptable.

However, when dealing with other types of data, such as data processingdocuments, spreadsheets, or backup datasets, it is vital that any methodof size reduction applied to these data be completely and perfectlyreversible to recover the exact original information. Thus, we must uselossless data size reduction methods with such data.

There are currently two main categories of lossless size reduction:compression and deduplication.

Lossless compression—as distinct from deduplication—could be defined asa method that uses Local Information to determine how to represent agiven subset of data in a more efficient manner. Examples of thisinclude Huffman encoding and LZW coding, both of which have beendescribed in detail in public documents.

Deduplication could be defined as a method that uses non-localinformation to determine how to avoid storing the same content twice.For example, if the data for an email program is hosted on a server, anda particular email message is sent out to many people on the sameserver, the server can maintain one copy of that email message ratherthan keeping one copy for each recipient, as the message is known to bethe same for each recipient. This particular kind of deduplication isknown as single instance storage, and is limited in scope because veryfew applications exist where entire files (in this case, email messages)are identical and can thus be stored once; furthermore this is easier todo in the email case because the server is aware of the fact that thefiles are identical without having to examine them. More generaldeduplication approaches search for identical fragments of one or morefiles and represent those fragments by storing only one copy of theoriginal fragment and using references to that original fragment toreplace the other copies of that fragment.

There is no reason not to combine compression and deduplication tofurther reduce the amount of storage needed to represent a givendataset, and indeed this is commonly done.

This Invention is a method of data deduplication that reduces the amountof main storage (RAM) required to deduplicate large files by asignificant ratio. This ratio can be in excess of 30-1 over the moreconventional approach.

V.(4) Setting Up the Conditions for the Example we Teach

Assume that we have two files:

(FileX) A roughly one-terabyte file representing ayear-to-October-1^(St) email backup

(FileY) A slightly larger one-terabyte file representing ayear-to-October-2^(nd) email backup.

Clearly FileY is somewhat larger than FileX because it includes theemails that came in one day after the FileX-backup was taken.

For convenience, we call FileX the Reference File; and FileY the CurrentFile.

V.(5) Deduplication Using the Current Art. Processing the Reference Fileand Creating the Hash Table

The following is an oversimplification of the current art. We do so toteach the current art in order to compare it to the novelty of theInventors' new Method.

Prior to this Invention, those familiar with the current art would dothe following: FileX (the Reference File) would be broken up intofixed-sized blocks. We will use 512-byte blocks for our example, but itis common to use larger blocks such as 8192-byte blocks in the currentart so as to reduce the memory requirements for the hash table, althoughthis will reduce

For each 512-byte block, a hash (hash is a synonym for “digest”) iscreated. The hash is almost always a small fraction of the size of theblock: If the hash is computed using the well-known CRC64 algorithm, thesize of the hash would be 8 bytes. That is, the 512-byte block isrepresented in a field in some table, typically a hash table, by amuch-more manageable 8-byte quantity.

For convenience in teaching the art, we assume that all implementers ofthe current art use hash tables.

Each entry in the hash table will contain the hash value and a blocknumber (or, equivalently, the byte position) indicating which block inthe current file is represented by that particular hash (i.e. digestvalue), There may be other items in each entry in the hash table, butfor purposes of this summary, we assume that each entry in the tableonly consists of the hash value and the block number in the currentfile.

If

1. the algorithm used to create hashes is, for instance, the well-knownCRC64, and

2. the block number is an 8-byte quantity then each entry in the hashtable will be 16 bytes: eight for the CRC64 and eight for the blocknumber.

If the block size is 512 bytes and we are populating a hash table forthe one-terabyte FileX, then the hash table size will need to beapproximately 32 gigabytes in size.

This 32 gigabytes is computed as follows: in a terabyte file, there areapproximately two billion 512-byte contiguous non-overlapping blocks.Each hash table entry is sixteen bytes long; hence a compact hash tablewould need to be 32 gigabytes in size.

This 32 gigabytes of RAM for the hash table substantially exceeds thesize of RAM that can be supported on a typical modern (circa June 2009)home computer.

As those who are familiar with the art know, hash tables become veryinefficient when they get more than about 30% full. Thus, the realamount of RAM necessary to properly support FileX using 512-byte blockswould reasonably exceed 100 gigabytes. Even using 8192-byte blocks wouldstill require over 6 gigabytes just for the hash table given the30%-full capacity constraint.

V.(6) Deduplication Using the Current Art. Processing the Current File,Accessing the 32-Gigabyte Hash Table, and Bookkeeping

For each 512-byte block, in FileY (i.e. the Current File), the hash codeis calculated and the hash table created for FileX is searched for amatching hash value. If there is a match then the underlying 512-byteblock from FileX is read and compared to the 512-byte block currentlybeing processed in FileY; and if the underlying 512-byte blocks in FileXand FileY (i.e. the Reference and Current Files, respectively) match,then a reference to the block in FileX is made and the duplicate data inFileY (i.e. the Current File) is noted by the bookkeeping done byconventional method of the prior art and the new Method that we teach.If there is no match, then this is new data; and this new data is notedby the bookkeeping done by both the conventional method of the prior artand the new Method that we teach.

No matter what bookkeeping is used to keep track of matching andnon-matching data, we teach that there is—in our example,above—approximately a 100-gigabyte hash table.

V.(7) Deduplication Performance of the Preferred Embodiment of thisInvention; Part I

Before we teach the new art of the Invention, it is useful to note thatthe Inventors have observed that in actual use of the implementation ofthe Preferred Embodiment of the Invention, that it is not unusual toremove 95% of the duplicate data between the Reference File and theCurrent File.

V.(8) Deduplication Using the New Art

100 gigabytes of RAM for a hash table is a large resource in order tosupport deduplication of FileX and FileY. The Inventors' Methodsubstantially reduces this 100 Gigabyte resource requirement.

The essential novelty of the Invention is that we teach that it is notnecessary to store a hash table entry for each 512-byte block; but,instead, that the hash table can be oversubscribed in such a way thatonly a fraction of the hash table entries generated by the conventionalmethod need appear in the hash table.

An astute reader would ask, “If one does not store every hash tableentry, how can one possibly remove 95% of the duplicates as was taughtin [0067] [0035] V.(7)?”

The answer is that there is extra information in the Reference andCurrent Files that is not being exploited by the current art. TheInventors' novel Method exploits this information.

The extra information that this Invention exploits that the current artdoes not exploit is that we examine local information (See [0038] [0035]V.(2)) for runs of identical data.

Let us make this clear with an extreme example.

We slightly modify the conditions we set up in [0049] [0035] V.(4) byassuming that the one-terabyte Reference File and the Current File areidentical. That is, FileX and FileY are identical.

Further, assume that we limit the size of the hash table to a singleentry and that the position selected for the hash entry is for position549,755,813,888 (=2 39) in the Reference File; that is, roughly halfwayinto the Reference File but at a block position divisible by 512.

Note that for this example, any 512-byte-block position would workequally well.

We now initially perform exactly the same set of operations that wetaught in [0064] [0035] V.(6) except that the size of the hash table isno longer 32 gigabytes but is, instead, only the size of a single tableentry: 16 bytes.

We initially repeat the operations we taught in [0064] [0035] V.(6).

For each 512-byte block, in FileY (i.e. the Current File), the hash codeis calculated and the hash table (which now only contains a singleentry) created for FileX is searched for a matching hash value. Forpurposes of this explanation, assume that there is no match of the hashvalues until we have processed 1,073,741,824 blocks in the Current File;thus reaching byte position 549,755,813,888.

At byte position 549,755,813,888 we now read a 512-byte block and weknow that we will have a hash value match because, by definition, theReference and Current files are identical and identical data generatesidentical hash value matches.

We now read the 512-byte block at byte position 549,755,813,888 in theReference File because that is the block of data that the hash table ispointing to for that hash value.

Again we know that the two blocks will compare equal; by definition ofthis example.

We now deviate from the current art. This is the “Aha!” moment.

At this point we would read backwards sequentially from position549,755,813,888 in both the Reference and Current Files looking for amismatch. Because the Reference and Current Files are identical, nomismatches would be found. Once beginning-of-file is detected (or, inthe more realistic example, a mismatch was detected), the Method weteach would then seek to position 549,755,813,888+512 and read forwardsuntil a mismatch was detected or end-of-file was detected. Of course, nomismatches would be detected.

The bookkeeping that will be done in the Preferred Embodiment of theInventor's Method would indicate that starting at byte 0 in the CurrentFile that there was a terabyte of matching information starting at byte0 in the Reference File.

Thus, in the conditions we specified in [0049] [0035] V.(4), a terabyteof identical information would be reduced to a handful of bytesequivalent to the bookkeeping entry that stated that “starting at byte 0in the Current File, there was a terabyte of matching informationstarting at byte 0 in the Reference File”.

A more typical example of a bookkeeping entry might look something like“At byte 614400 in the Current File there is a run of 307200 bytes thatis identical to data that can be found at position 921600 in theReference File.”

What makes this new Method efficacious is that the Inventors havenoticed that in the class of files of interest (e.g. backup files oflarge systems separated by some length of time such as in our example of[0049] [0035] V.(4)) that there are very long runs of identical datathat happen to be in different places in the two files.

By “randomly” probing the Reference File with only a subset of hashvalues—say one out of every 32 512-byte blocks—and using thisbackwards-and-forwards (or, equivalently, forwards-and-backwards)searching to find long runs of data, one can drastically reduce thenumber of hash table entries and still achieve 95% or better datadeduplication.

V.(9) Definition of the Subscription Ratio

We define the “subscription ratio” of the hash table as the ratio of thenumber of blocks in the reference file whose hash code has beencomputed, to the capacity of the hash table. For example, if theReference File consists of 1000 blocks but the hash table has a capacityof 100 entries, then the oversubscription ratio would be 10:1,regardless of whether the hash table has one or more open entries afterall the entries for the reference file have been processed.

To explicate this further by an extreme example of a degenerate case,

1. If the Reference File consists of 1000 blocks of all binary l's and

2. The hash table has a capacity of 100 entries then all the hash codeswould be identical and would be indexed into the same hash table entrysince all the hash codes would be calculated using blocks with identicalcontents.

Thus 99 out of 100 hash table entries would be unused. Nonetheless, theSubscription Ratio as we define it would still be 10:1 because therelative sizes of the number of blocks versus the limiting size of thehash table is 10:1.

V.(10) Deduplication Performance of the Preferred Embodiment of thisInvention; Part II

As those who are familiar with the art readily understand, one wants tominimize the size of the blocks for which hash values are computed whilemaximizing the number of entries in the hash table in order to getfine-grained probes of the Reference File. The larger the block size,the more likely that a run will not be detected. This will be furtherexplained in the specification in Section [0244] XI.

The Method we teach produces high deduplication efficiency (often up to20-1; higher deduplication performance will be achieved with highlysimilar data in the Current and Reference Files) with oversubscriptionratios of 30 to 1 or more, allowing us to use small (e.g., 512 byte)blocks even for 200+GB files while consuming only 128 MB of RAM.

The Method we teach also scales well. Unlike the methods used bypractitioners of the current art, our Method will continue to workgracefully but with lesser deduplication efficiency no matter how largethe files become. That is, there is no precipitous drop-off inperformance as the Reference or Current Files grow in size or the numberof table entries in the hash table shrinks.

As for speed circa June 2009, we can achieve input processing rates upto the 10-15 MB/sec range on inexpensive commodity computers with twoSATA drives, depending of course on the amount of duplication in thefiles and how large the Reference File becomes. SSD drives provideroughly the same maximum speed but the speed is not as dependent on thesize of the Reference File.

V.(11) What about Insertions and Deletions?

An astute reader might ask, “This works well for data that is alwaysidentical at 512-byte boundaries. But what if there was an insertion ordeletion of a byte at the very beginning of the Current File? That is,what if the Current File is identical to the Reference File except thatthe Current File is one byte longer than the Reference File because abyte has been added to the beginning of the Current File? Doesn't thatremove the identicalness of the digests and thus it would be unlikelythat any digest matches would be found and thus there would be almost nodeduplication?”

The answer we teach is not in this Summary. The novel Method we teach,below, easily handles this insertion or deletion of a byte. See Sections[0348] XXI and [0526] XXIII (4), below.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1: Use of L2 cache and the DED Access Accelerator (DAA)

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT VI. Definitions

Bit: On most computer architectures, individual bits do not have anaddress and it takes software to access a particular bit. Nonetheless,this is well understood by those familiar with the art.

Byte: Most modern computers organize information into eight-bit bytes.Bytes are usually addressable directly by hardware. The term is wellunderstood by those familiar with the art.

Reference: Some means of identifying an object. Pointers, names, blocknumber, byte offsets, etc., are various means of referring to data. Theterm is well understood by those familiar with the art.

Run: Two arrays of bytes that are identical. A Run can occur at twopositions in the same file or in different files. One of the objectivesof the Inventors' Method is to take Runs of bytes and eliminate theredundancy between them by replacing one of the arrays with a reference.The term is well understood by those familiar with the art.

Run Length (RL): The number of bytes that represents the length of theRun.

User Option: Those familiar with the art understand that there areinnumerable ways for users to select how a program may operate. Theselection may be as simple as clicking on a check box or as complicatedas setting a physical switch somewhere across the world or in outerspace. Sometimes a user will select an option such that the program willuse one or more of a multitude of strategies and heuristics to modifyits behavior. Often an option will be selected because the user knowsmore about the data coming in than the program does and selecting theproper option can optimize the software's behavior. This term is wellunderstood by those familiar with the art.

Compile-time Parameter: A value that is specified at compile-time. Anexample of a compile time parameter might be the maximum length of atext-string that the user can enter at run time. As those familiar withthe art understand, many Compile-time Parameters are more-or-less easilyconverted to User Options. This term is well understood by thosefamiliar with the art.

Deduplication: The removal of duplicate data by replacing one or moreRuns with a reference to a single copy of the Run or a reference to analgorithm that can generate the Run. This term is well understood bythose familiar with the art.

File: An ordered sequence of bytes generally known by a name and managedby an Operating System. This term is well understood by those familiarwith the art.

Close: When performed on a file, a Close operation is the last operationperformed on a file before Operating System resources are freed for thatfile. This term is well understood by those familiar with the art.

Chapter: We use a very similar definition of Chapter to that used in6374266. We define Chapter to mean the collection of data necessary tologically or physically recreate the state of file at a point in time.Chapters are stored in an Extended Reference File (ERF) the definitionof an ERF can be found in this section, below. The concept of a Chapteris further explained in section [0556] XXXV.

Reference File: A base or old File for which there exists a newer ordifferent version. While the authors have not coined this phrase (see,for instance http://eurosys.org/blog/?p=15, incorporated by referenceherein), the phrase is not well-known by those familiar with the art.The word “reference” to mean “pointer” as defined above will be in lowercase in this specification. The words “Reference File” will have theirinitial letters upper case in order to distinguish the usage of the twosenses of the word “reference.”

Current File: A (typically newer) version of a Reference File. The terms“Reference File” and “Current File” are terms of convenience since thetwo files may be unrelated. For the method we teach to be efficacious(that is, to remove data redundancy), the sum of Run Lengths between theCurrent and Reference File should be a substantial fraction of the sizeof the Current or Reference File.

EOF: End Of File. This is a position representing the number of bytes ina file. More usually it is a byte position in a byte-oriented file afterthe last byte of the file. If the file is 100 bytes in length, then EOFis 100 when using C-style zero-based file positioning. EOF is a term ofart is well understood by those familiar with the art.

Current File Pointer (CFP): A pointer to a position in the Current File.This pointer is used and updated as the Current File is scanned frombeginning to EOF.

Output Packets: As the Current File is compared against the ExtendedReference File (see the next definition), the Method will create somebookkeeping entries that indicate, for instance, where there isduplication of data; or an algorithm that can generate data. A typicalexample of the first kind of bookkeeping entry might look something like“At byte 614400 in the Current File there is a run of 307200 bytes thatis identical to data that can be found at position 921600 in theReference File.” Another example of the second kind of bookkeeping entrymight be “At byte 102400 in the Current File there is an array of length1057 bytes of zeros.” These bookkeeping entries are structured intoOutput Packets of various lengths. Those familiar with the artunderstand that there are many ways to structure the Output Packets.

Extended Reference File (ERF): In the Preferred Embodiment, the OutputPackets are appended to the Reference File. This need not be done. Asthose who are familiar with the art understand, the Output Packets couldbe stored nearly anywhere: In a separate file or files, on anothercomputer in a network, or even stored somewhere in so-called cloudstorage. If Output Packets are appended to the Reference File as is donein the Preferred Embodiment, then ERF is a term of convenience sinceonce the Output Packets have finished being appended to the ReferenceFile, the Extended Reference File becomes the Reference File for thenext Current File. Every time that a Current File is deduplicated and aset of Output Packets representing a new Current File are added to anExtended Reference File, a new Chapter is created. One of the manypossible structures of an ERF can be found in Section [0402] XXVII. TheInventors have coined this definition.

Adding a file: There may be many Current Files that are processed oneafter another to create many Chapters in an Extended Reference File. Wedefine this process of adding Output Packets to the Extended ReferenceFile as “Adding a file.” This process is similar to, say, adding a fileto a zip file.

Reconstituting a Current File: There may be many Current Files that are“added” to the Extended Reference File. Reconstituting the file meansrebuilding the Current File of interest from the data within theExtended Reference File. This process of reconstituting a file issimilar to extracting a file from a zip file.

Repository: A place where data is stored in, generally, a non-volatilemanner. Typical Repositories are disk drives, tape drives, memorysticks, etc. When we use the term Repository we generally mean diskdrive although the Method applies to other implementations ofRepositories.

Intrafile Compression: The repeated use of Local Information to reducethe size of a file. A common example of this is the well-known zip filealgorithm. This kind of compression is well understood by those familiarwith the art.

Intrafile Duplication: If a single file has two different positions inthe same file in which there is a Run, then we say that there isIntrafile Duplication.

Interfile Duplication: If two different files have one or more Runs,then we say that there is Interfile Duplication.

Intrafile Deduplication: The process of removing some or all duplicatedata within a file by analyzing the entire file for duplicate data. OurMethod is one of many methods to perform Intrafile Deduplication.

Delta Compression: A collection of techniques well known to thosefamiliar with the art that creates a file (a so-called Patch File,Difference File, or Delta File; three synonyms for the same thing) thatwhen appropriately combined with a Reference File can recreate a CurrentFile. This term is well understood by those familiar with the art.

Patch File: A file that, when appropriately combined with some otherfile, crates a third file. See Delta Compression, immediately above.This term is well understood by those familiar with the art.

Image Backup: A file that represents the state of a computer'sRepository (or portion, thereof) at a moment in time. Typically this isa sector-by-sector (i.e. a bit-by-bit) copy of the computer's hard disk.This term is well understood by those familiar with the art.

Forensic Image Backup: A backup in which an entire source Repository iscopied bit-by-bit to a backup repository so that the state of the entiresource Repository is captured and preserved. This term is wellunderstood by those familiar with the art.

Allocated Image Backup: A file, or files, or (backup) Repositoryrepresenting the state of a computer's hard disk at a moment in time.Typically this is a sector-by-sector (i.e. a bit-by-bit) copy of thecomputer's hard disk in which the backup program knows which sectorshave been allocated by the various Operating Systems using thecomputer's Current Repository (e.g. disk drive) and only those sectorsare copied to the target Repository. This term is well understood bythose familiar with the art.

Current Repository: A place where data is stored in a nonvolatile mannerthat is generally accessible to a computer's current operations. This isa term of logical convenience.

Backup Repository: A place where backup data is stored in a nonvolatilemanner that is generally logically inaccessible to a computer's currentoperation. This is a term of logical convenience since some users ofcomputers can keep their backup data permanently accessible andmodifiable.

Operating System: A computer program that, typically, provides aninterface between general software and the hardware on which the generalprograms run. Examples are Microsoft Windows, Linux, Mac OS, and Unix.This concept is well understood by those familiar with the art.

Reversible Digest: A piece of data or an algorithm that can uniquelyrecover the original data from which it came. A non-exhaustive set ofexamples of reversible digests would be (1) An algorithm to generatefifteen bytes of zeros in a row; (2) an LZW compression of 512 bytes;(3) the original data itself; (4) Reversing the bytes of an array. For afuller explanation of the term, see Section [0369] XXIII. The Inventorshave coined this definition.

Digest: A possibly small piece of data that represents a larger piece ofdata. Often the digest is a hash code (e.g. CRC, MD5) of a larger pieceof data. This concept is well understood by those familiar with the art.“Digest” and “hash code” are synonyms and are often usedinterchangeably. This term is well understood by those familiar with theart although, for the purposes of this specification, the Inventors usethe phrase slightly differently from the generally accepted meaningbecause we use the term to mean either a hash code or a reversibledigest.

Digest Entry Data (DED): A data structure comprising a digest (e.g. ahash) plus other data necessary and/or convenient to identify where theraw data came from. In the Preferred Embodiment, “where the data camefrom” is a byte offset in the Extended Reference File (ERF). Forconvenience and computational efficiency, the DED may contain more thana single digest (as well as other data). In the Preferred Embodiment,the DED contains a computationally expensive CRC64 as well as acomputationally inexpensive “Rolling Hash.” In a hash table, the DEDwould be a hash table entry. In the Preferred Embodiment, the DED is ahash table entry. For those familiar with the art, the fields of the DEDcould be spread out among multiple tables. For instance, if the DEDcomprises the following three fields (1) A CRC64, (2) A Rolling Hash,and (3) A position in the ERF, then (except for performance) it iscomputationally equivalent to place these three fields into, say,separate arrays. In whatever manner these fields are stored, thecollection of fields is a DED. The Inventors have coined this definitionalthough the concept is well understood by those familiar with the art.

Matching-DED: A Digest Entry Data (DED) may have one or more Digests asfields. For instance, in the Preferred Embodiment, the DED has twoDigests: a CRC64 and a Rolling Hash. If a particular DED has a Digestthat matches a particular value, then we say that the DED is aMatching-DED. Note that the match is on the Digest values and notnecessarily on the underlying data in any file.

Key: A key is a field in a database that a program can use to selectrecords; in our case the records are DEDs. This concept is wellunderstood by those familiar with the art.

Hash Index: A transformation of a hash code value that becomes an index.For instance, assume that a hash table has 100 entries that that thehash being used is a CRC16. CRC16 will generate a hash code valuebetween 0 and 65535. One simple way to convert the hash code from0-65535 to a range of 0-99 is to simply take the CRC16 value and use thevalue modulus 100. As another example, assuming that the hash table has256 elements, then all one needs to do is use the bottom eight bits ofthe hash code value to provide a convenient index into the table. Asanother example, assuming that the hash table has 256 elements, then allone needs to do is use any eight bits of the hash code value to providea convenient index into the table.

Digest Data Structure (DDS): A data structure (typically, a hash table)that contains a collection of DEDs collected from an Extended ReferenceFile (ERF) (see definition of ERF, above). In general, the DDS is a datastructure that is stored separately from the ERF. In the PreferredEmbodiment, the DDS is stored as a separate file. Indeed, if needed, theDDS can be recreated from a Reference File. The Inventors have coinedthis definition.

Digest Data Structure Hash Count (DDSHC): The number of DEDs that a DDSis currently storing. The Inventors have coined this definition.

Digest Data Structure Maximum Hash Count (DDSMHC): The number of hashvalues that a DDS could theoretically store given the amount of RAMand/or other resources available. If the DDS is a hash table then theDDSMHC is the number of entries in the hash table array. The Inventorshave coined this definition.

Database Mapping Engine (DBME): A generic name for any algorithm thatwill store and/or fetch a key/keys and a value/values. Keys and valuesare well-understood terms for those familiar with the art. Often, it isconceptually easier to think of giving the DBME a key (e.g. a CRC64) andlet the DBME convert the key from a hash value to a hash index. TheInventors have coined this definition.

n-byte-block: In the Preferred Embodiment, an n-byte block is acontiguous block of RAM memory or disk storage of 512 bytes. In thecurrent Preferred Embodiment, the size of the n-byte-block is aCompile-time Parameter. Those familiar with the art understand that thisCompile-time Parameter size of 512 bytes could, more or less, be easilyconverted into a User Option. The n-byte block has no particularalignment. The Inventors have coined this definition.

N-bytes: the number of bytes in an n-byte-block. In the PreferredEmbodiment this is 512. The Inventors have coined this definition.

variable-length-n-byte-block (VLNBB): Similar to an n-byte-block butwhere the number of bytes is not of fixed length. The implementation ofvariable-length-n-byte-blocks in the Method is explained in section[0641] XLII. The Inventors have coined this definition. The conceptshould be well understood by those familiar with the art.

variable-length-n-byte-block-length-table (VLNBBLT): This is a table ofunique lengths of variable-length-n-byte-blocks (VLNBBs). For example,if there are 100 VLNBBs computed for a Reference File and of those 100,99 have a length of 512 bytes and one has a length of 8K bytes, then thevariable-length-n-byte-block-length-table will have two entries: (1) avalue of 512, and (2) a value of 8K. The Inventors have coined thisdefinition.

N-variable-length-n-byte-block-length-table (NVLNBBLT): This is thenumber of entries in the variable-length-n-byte-block-length-table(VLNBBLT). For example, if there are 100 VLNBBs computed for a ReferenceFile and of those 100, 99 have a length of 512 bytes and one has alength of 8K bytes, then the value ofN-variable-length-n-byte-block-length-table will be 2. The Inventorshave coined this definition.

NN: The (possibly non-integer) number of contiguous non-overlappingn-byte-blocks in a Current and/or Reference File (as appropriate in thecontext of the discussion). Where NN is non-integer then the readershould round this value up or down to an integer—as appropriate orconvenient—if the context of the discussion warrants it. The Inventorshave coined this definition.

B-byte-block: In the Preferred Embodiment, a B-byte-block is acontiguous block of RAM memory of 256K bytes. The B-byte-block has noparticular alignment and can be larger or smaller than 256K. In thePreferred Embodiment, the size is a Compile-Time Parameter but thosefamiliar with the art recognize that this value may be dynamicallyselected depending on available system resources such as available RAMor the size of the B-byte-block could be set as a User Option.Additionally, the size may be computed dynamically so as to maximizesystem I/O throughput.

BN: Size, in bytes, of the B-byte-block.

BB: The number of (possibly non-integer) B-byte-blocks in a Currentand/or Reference File (as appropriate in the context of the discussion).Where BB is non-integer then the reader should round this value up ordown—as appropriate or convenient—to an integer if the context of thediscussion warrants it. The Inventors have coined this definition.

Subscription Ratio (SR): The ratio of the number of n-byte-blocks in aCurrent and/or Reference File (as appropriate in the context of thediscussion) to the DDSMHC. That is, NN divided by DDSMHC. A fullerexplanation of the SR can be found at [0092] [0035] V.(9). The Inventorshave coined this definition.

Oversubscription: A Subscription Ratio (SR) greater than 1.0.

Undersubscription: An SR less than 1.0.

Plug-in: A piece of computer code that can be provided by the user toperform operations generally too complex to be provided as a UserOption. For instance, the well known web browser, Firefox, allows thirdparties to provide plug-ins in order to extend the capability of thebrowser. The term is well understood by those familiar with the art.

Reference File DED Storage Strategy (RFDSS): Because the DDS isoversubscribed so that it is quite likely that there will be more hashcode values than there is room in the DDS to store all the DEDsassociated with the hash code indices, there needs to be some strategyto select which DEDs to insert and which to ignore. Perhaps the simpleststrategy is to use the Subscription Ration (SR) rounded to the nearestinteger such that for every SRth n-byte-block being emitted to the ERFthere would be a DED generated and that that DED be inserted into theDDS. Another possibility is to allow the user to write a plug-in inorder to select which DEDs will be inserted into the DDS. The Inventorshave coined this definition.

Extended Reference File DED Storage Overwrite Strategy (ERFDSOS):Because the DDS is oversubscribed, as new Output Packets are generatedand new n-byte blocks of data are logically appended to the ExtendedReference File (ERF), a strategy needs to be used to determine when tooverwrite older DEDs. There are many strategies that can be used todetermine when to replace older elements in the DDS with newer DEDsrepresenting newly created Output Packets. Several strategies will beexplained in [0531] XXXIII.(5). Perhaps the simplest strategy is to usethe Subscription Ration (SR) rounded to the nearest integer such thatfor every SRth n-byte-block being emitted to the ERF there would be aDED generated and that that DED be inserted into the DDS. Another simplestrategy would be to have a user option specify an integer value, K,that would be used instead of SR. Note that the ERFDSOS can be the sameor different than the RFDSS. Another possibility is to allow the user towrite a plug-in in order to select which DEDs will be inserted into theDDS. The distinction between the ERFDSOS and the RFDSS is minor: TheRFDSS is used when the DDS is initially empty and the ERFDSOS is usedwhen there is an existing DDS that was previously created and is likelyfull of DEDs. The Inventors have coined this definition.

Natural Language: A human language such as English, French, or Japanese.

Computer Language: A term well understood by those familiar with theart. A few examples are C++, FORTRAN, COBOL, C#, PL/I. Often, computerlanguages are well-defined by ANSI and/or ISO standards. This term iswell understood by those familiar with the art.

Flow Chart: is a term familiar to those familiar with the art ofcomputer programming. It is a schematic representation of a programand/or algorithm. This term is well understood by those familiar withthe art.

Edge Condition: is a term familiar to those familiar with the art ofcomputer programming. It is a condition that occurs “relatively rarely”while a program is running and represents a complication that must behandled. Some of the many possible Edge [0166] Conditions are: (a)End-of-file, (b) zero elements in an array, (c) one element in an array,(d) end-of-program cleanup, (e) I/O errors. Edge conditions canintroduce a great deal of complexity to computer programs. Whenexplaining the essence of an algorithm, edge conditions are often leftout in order to make the understanding of the algorithm easier. Thisterm is well understood by those familiar with the art.

Pseudocode: is a term familiar to those familiar with the art ofcomputer programming. It is a textual representation of a program and/oralgorithm that loosely uses the, typically structural, conventions ofmany programming languages, but omits many of the details of a specificcomputer language-specific syntax, generally written in a NaturalLanguage. It is often easier for those familiar with the art tounderstand an algorithm if presented using Pseudocode rather than eithera detailed computer program written in a Computer Language or a FlowChart. Typically, Edge Conditions are often ignored when writingPseudocode when the purpose is to teach the central concepts of analgorithm or a Method. This term is well understood by those familiarwith the art.

Page: A section of computer memory that may be in RAM or on disk. Thisterm is well understood by those familiar with the art.

Page Fault: If a program requests a Page of memory that happens not tobe in high-speed RAM then a Page Fault occurs. Typically, the operatingsystem will attempt to bring the section of memory in fromslower-but-cheaper non-RAM (typically, disk-based) storage. Ondisk-based devices this usually involves a time-expensive seek operationfollowed by a read operation. This term is well understood by thosefamiliar with the art.

Working Set: The set of all Pages (in a paging virtual memory system)used by a process during some time interval. This term is wellunderstood by those familiar with the art.

Rolling Hash: A hash code that can be calculated as a recurrence; thatis, as a function of an existing hash code and a few other data items.This contrasts with a non-Rolling Hash that must be recalculated fromscratch for any change in the data. This term is well understood bythose familiar with the art.

Good Hash: A Good Hash has certain desirable statistical propertieshaving to do with the pseudo-randomness of the hash value as compared tothe input. The current technology is such that a Good Hash iscomputationally expensive compared to a Rolling Hash. Examples of GoodHashes are CRC64, MD5, SHA, etc. The desirable properties of a Good Hashare well understood by those familiar with the art.

False positive: If one uses a digest (e.g. hash) with a fewer number ofbits than the underlying data, then it is guaranteed that under somecircumstances that the digests will match but the underlying data willnot. This situation is called a false positive.

True positive: The opposite of a false positive; the underlying datamatches. If one uses a digest (e.g. hash) with a fewer number of bitsthan the underlying data, then it is often true that the digests willmatch and the underlying data also matches.

Red-black tree: A data structure in which searches, insertions, anddeletions are done in O(log n) time. This term is well understood bythose familiar with the art.

Decoration: A block of data of interest is decorated when more data isadded to the block in order to demarcate the block. Decorating a blockis conceptually similar to placing a letter in an envelope. This term iswell understood by those familiar with the art.

Packet: A block of data, generally transferred from one place toanother. This term is well understood by those familiar with the art.

Matching Data Packet (MDP): A packet emitted to the Extended ReferenceFile (ERF) representing a string of bytes for which data is beingdeduplicated. In an MDP there will be a reference to a position in theERF as well as a length. The Inventors have coined this definition.

Literal Data Packet (LDP): A packet emitted to the Extended ReferenceFile (ERF) representing an array of bytes for which there is no matchingdata that has been detected by the Method. The Inventors have coinedthis definition.

Zero Data Packet (ZDP): A packet emitted to the Extended Reference File(ERF) representing an array of bytes that have zeros as the contents ofthe bytes. The Inventors have coined this definition.

Algorithmic Data Packet (ADP): A packet emitted to the ExtendedReference File (ERF) representing an array of bytes in which somealgorithm is used to generate the data in the array of said bytes. A ZDPis a trivial example of an Algorithmic Data Packet. The Inventors havecoined this definition.

Chapter Packet: A special packet that marks the end of a Chapter and isthe last packet written before the Chapter is closed and the File isclosed. Chapter Packets in the Extended Reference File are chainedtogether so that one can find a preceding Chapter Packet from anyChapter Packet except the first one. The first Chapter Packet has aspecial “no preceding Chapter Packet” flag. The Inventors have coinedthis definition.

Reference File Analysis Phase (RFAP): The first of two major phases ofoperation of the Method we teach. In this phase of operation, the Methodbuilds the DDS from data in the Reference File. The Inventors havecoined this definition.

Current File Redundancy Elimination Phase (CFREP): The second of twomajor phases of operation of the Method. In this phase of operation, theCurrent File is examined and—using the DDS and the Extended ReferenceFile—data redundancy is eliminated between the Reference File and theCurrent File. The Inventors have coined this definition.

Write Once Read Many (WORM): A device that has the physicalcharacteristic that data written to media for this device cannot beerased. This term is well understood by those familiar with the art.

Write Fence: A position in a file or on a logically linear storagedevice that enforces a restriction on writing to the file or deviceprior to a certain point. A WORM device must know where the Write Fenceis for a particular piece of medium. The sense of a Write Fence, though,is that some “meta” operation could reset the position of the WriteFence on, say, a disk drive simulating a WORM device thus allowing therewriteable media in the disk drive to be reused.

Maximal Run Length (XRL): A Run, as measured in bits or bytes, asappropriate in context, of maximum possible length. The authors havecoined this phrase.

Submaximal Run Length (SXRL): A Run, as measured in bits or bytes asappropriate in context, of less than or equal to the maximum possiblelength. The authors have coined this phrase.

Minimum Run Length (MRL): Since emitting a Run to the Extended ReferenceFile may require decoration, the Minimum Run Length is a value in bitsor bytes (as appropriate in context) that the Method may use to limitthe emitting of Runs of a short Run Length in order to reduce the sizeof the Extended Reference File. The Minimum Run Length may be one bit orone byte as appropriate in context. The authors have coined this phrase.

First In First Out Queue: (FIFO). This term is well understood by thosefamiliar with the art.

Last In First Out Queue: (LIFO). This term is well understood by thosefamiliar with the art.

Temporary Literal Buffer (TLB): A buffer where literal data isaccumulated. As the Current File is being processed sequentially, theMethod will detect arrays of bytes that do not appear in the ExtendedReference File. These bytes are accumulated in a Temporary LiteralBuffer until matching data is detected. Among the ways a TemporaryLiteral Buffer could be implemented is as a LIFO or FIFO queue.

Current File Wall (CFW): Is a position in the Current File. As bytes areemitted to

the Temporary Literal Buffer (TLB), or

a Matching Data Packet (MDP) is emitted, or

Literal Data Packet (LDP) is emitted, or

An Algorithmic Data Packet (ADP) emitted, the Current File Wall isupdated to a higher position in the Current File representing thehighest position in the Current File emitted.

Expanding-The-Run: If we detect a short Run (say, of 512 bytes), then wewish to look at both sides of the Run to see if bytes continue to match.On the low-order side of the run, the expansion stops at the CurrentFile Wall (CFW) or at the maximum count that the Run Length (RL) canmaintain. At the high-order side of the Run, the run stops at EOF or atthe maximum value that the RL can maintain. As those familiar with theart understand, the order in which one searches for a mismatch of byteson either side is irrelevant. The concept is further explained in [0354][0348] XXI.1. The Inventors have coined this phrase.

Expanded Run: The result of Expanding-The-Run. It is not necessary forthe length of the Expanded Run to be the maximum possible Run Length(RL). The Inventors have coined this phrase.

Insertion/Deletion Algorithm: An algorithm that is able to look at twobinary strings and detect and report insertions and deletions in one inorder to produce the other. These algorithms are well known to thosefamiliar with the art. For instance, as we write this,http://code.google.com/p/xdelta/ (incorporated herein by reference) is alink to open source to a program called xdelta that performs binarydifferences.

Magic Number: This is a bit pattern that is used to indicate or verifythat correct data has been placed in a particular location. Generally, aMagic Number is a bit pattern that is expected rarely to be seen. Thisterm is well understood by those familiar with the art.

Locality of Reference: Is a term of art that, roughly, means thatcomputers can process data faster when the data to be processed isrelatively compact along some relevant dimension. Thus, on a disk drivewith a single head, data that is on a single track can be fetched fasterthan data on nearby tracks, which, in turn, can be processed faster thandata on tracks that are farther away. Similarly, data in RAM that ismirrored in an L3, L2, or L1 hardware cache will be processed far fasterthan data that is not in these caches.

RAM Memory Hardware Cache (RMHC): In modern (circa 2009) computers,there is a small amount of very fast memory that the CPU can access. TheL1 cache is, generally, physically on the CPU chip itself, and tends tobe quite small; on the close order of 64K. The L2 cache (on the closeorder of 2 megabytes) tends to be intermediate in speed between RAM andthe L1 cache.

DED Access Accelerator (DAA): This is a data structure that copies aportion of the Digest stored in each DED into a compact array that issmall enough to fit into a cache (e.g. an L2 cache) that is faster thanmain RAM memory, in order to increase the performance of hash tablelookups. This is more fully explained in [0663] XLIII. The Inventorshave coined this phrase.

Proxy: When used in the context of the DED Access Accelerator (DAA), inthe Preferred Embodiment, the one-byte Proxy is the low order byte ofthe Rolling Hash stored in the DED. It could, of course, be some othercollection of bits of the Rolling Hash but a single byte is convenient.A nybble (4 bits) is another obvious size for a Proxy. The DM is anarray of Proxies.

Proxy: Synchronizer: A program that performs deduplication.

VII. How Data Redundancy Elimination Between Two Files is Currently Done

For small (on the order of 600 megabytes or smaller) datasets, a commontechnique to find the “difference” between two files is to use a “binarydifference” program (e.g. Visual Patch from Indigo Rose Software).Techniques of this kind will generally eliminate redundancies at closeto theoretical maximums.

Above that range, cruder commonality-removing (i.e. deduplication)techniques are used.

In general, to calculate a binary difference, the two input streams arelogically broken up into logically contiguous equally sized blocks ofdata.

A digest (i.e. hash) is computed for each n-byte block in one of thefiles. If the hash is a so-called MD5 hash, the hash will occupy 32bytes.

If the user has a gigabyte of RAM available—which is fairly typical fortoday's home computers—the user will be able to store roughly 30 millionMD5 hashes. That is, the DDSMHC will be 30 million.

If N-bytes is 512 bytes then a DDS that is one gigabyte in size will beable to represent an input file of approximately 16 gigabytes in size.This is far smaller than the typical modern 2009 backup file size.

Once the hashes are calculated for each block in one of the files, theother file is examined and hashes are calculated for the blocks in thatfile. A matching hash indicates common data between the two files.

One could increase N-bytes but that reduces the likelihood that a matchwill be found. Similarly, one could reduce the size of the hash from 32bytes to say, a 4-byte CRC but this has the undesirable effect ofincreasing the number of false positives.

VIII. Prior Art

Parts of this invention build on the technology of Shnelvar's patentU.S. Pat. No. 6,374,266. We make explicit note of U.S. Pat. No.7,055,008 and state that while this invention may use stochastic (or inthe alternate, heuristic) processes to determine if a block of data “isworthy of investigation to determine if the blocks are identical”, thisinvention does not use the mechanisms as taught in 7055008. Indeed, forthose devices with long seek times, we teach that we use stochasticprocesses on a collection of hash codes for blocks to determine if thecomputational expense of the delay associated with seeks is worthy ofthose blocks being fetched. Unlike 7055008, we do not assume that if twohash codes are identical that the underlying blocks are identical.

While this invention focuses on computer backups, it has applicabilityto the reduction in size of any two datasets as well as eliminatingduplication in a dataset (Intrafile Deduplication). An objective of thisInvention is not to require the reverse engineering of the internalstructure of the backup files of backup software; or, indeed, of anysoftware.

In order to understand the operation of the Invention, one shouldunderstand a point made athttp://cis.poly.edu/.about.suel/papers/delta.pdf (Algorithms for DeltaCompression and Remote File Synchronization). Specifically, “deltacompression” and “remote file synchronization” are quite different butrelated problems.

The present Invention relates to delta compression; but can be used forremote file synchronization. Indeed, it is one of the goals of thisInvention that it be used for Remote File Synchronization.

To paraphrase what Suel and Memon wrote in Algorithms for DeltaCompression and Remote File Synchronization:

In the delta compression problem, a Client Computer has a copy of aReference File and a Server Computer has copies of both the Referenceand Current Files and the goal is to compute a Delta File of minimumsize, such that the Client Computer can construct a Current File bymanipulating the Reference File and the Delta File.

“In the remote file synchronization problem, the Client Computer has acopy of the Reference File and the Server only has a copy of the CurrentFile and the goal is to design a protocol between the two parties thatresults in the Client Computer holding a copy of the Current File whileminimizing the communication cost.”

In the present invention it could be said that we turn some of Suel andMemon's definitions on their heads. (In actuality, we don't do thatsince which computer is the Client and which is the Server is a matterof convenience.) Specifically, in the class of problem that thisinvention addresses, we assume that the Client Computer contains boththe Reference File as well as the Current File and the objective is tocreate a series of patch file(s) that represent the various states ofthe Current File at various previous times.

Given that there may be many clients being served by a server and thatthe computation of the Delta File will take on the order of severalhours, it becomes clear that it would be wiser—for several reasons—touse the idle capacity of the clients rather than the servers to computethe Delta File(s).

There are other advantages to computing the Delta File on the Clientrather than the server. A non-obvious advantage is that the client couldthen select a different imaging program at different times, thusproviding a contingency against the possibility that a particularprogram is unable to restore the image it had created.

IX. A Typical Example

Consider a collection of files that represent daily Image Files of aparticular client computer. For instance, consider the following,possibly typical, scenario of five daily Image Files that represent thestate of a Client Computer on Monday through Friday.

On Monday a Monday Reference File is created.

On Tuesday the Tuesday Current File is created. The Client Computer thenuses (possibly idle) system time to create a Tuesday Patch File thatallows the Tuesday Current File to be created given the Monday ReferenceFile and the Tuesday Patch File. A Verification Pass is then optionallyperformed to guarantee that the Tuesday Current File can be properlyrecreated from the Monday Reference File and the Tuesday Patch File.Then the Tuesday Current File is deleted in order to save disk space.The Tuesday Patch File is appended to the Monday Reference File tocreate a Tuesday Reference File.

In the Preferred Embodiment of our Method, the combination of the MondayReference File and the Tuesday Patch File is called the ExtendedReference File. We call the contents of the Monday Reference File“Chapter 1”. We call the contents of Tuesday Current File as representedby the Monday Reference File plus the Tuesday Patch File “Chapter 2”,etc.

In the Preferred Embodiment of the Invention, there is no separate PatchFile. Instead, Output Packets are appended to a Reference File. As thosefamiliar with the art understand, splitting out, say, the Tuesday OutputPackets to form a Patch File is a trivial operation since the locationof the Chapter's beginning position is stored in the last packet (theChapter Packet) in the Extended Reference File.

On Wednesday a Wednesday Image File is created. The Client Computer thenuses idle system time to create a Wednesday Patch File that allows theWednesday Current File to be created given the Tuesday Reference Fileand the Wednesday Patch File. A Verification Pass is then performed toguarantee that the Wednesday Current File can be properly recreated fromthe Tuesday Reference File and the Tuesday Patch File. Then theWednesday Current File is deleted in order to save disk space. TheWednesday Patch File is appended to the Tuesday Reference File to createa Wednesday Reference File.

The same for Image Files created on Thursday and Friday and for everyday (or hour or minute or second) after that.

As those familiar with the art understand, whether the Patch Files areappended to the (Monday) Reference File or whether these files are savedindividually is a trivial implementation detail. Whether the Patch Filesare created using system idle time is an implementation detail. Whetherthe Current Files are deleted is an implementation detail. Whether aVerification Pass is performed is an implementation detail.

Indeed the Method of the current invention is not limited to image filesbut can easily be used for any collection of (large) databases and/orfiles.

The user of the present invention might find that it is efficacious torun backup program X on Monday and program Y on Tuesday so that the useris not relying on the ability of a single program to backup and restorethe user's data. This is a major feature of the Method.

As users and administrators have learned from bitter experience, “Backupis easy. Restoring is hard.”

X. Seek and Ye Shall Find it Slowly

As those familiar with the art understand, random access memory accessesin RAM are much faster than random access seeks followed by a read ondisk.

Even with RAM, so-called “locality of reference” is a desired feature.Most modern CPUs will keep recently used data in a high-speed cache. Asone of the authors of this Invention has noted in his book, EfficientC/C++ Programming, exploiting this feature of a CPU can dramaticallyincrease the speed of operation of a computer program.

For purposes of simplicity, in this section we shall ignore+the varioushardware caches that most modern microprocessors possess and speak inbroad terms.

Hard disk seek times, circa 2009, are measured on the order ofmilliseconds. 8 milliseconds is a fairly typical seek time, and thistime has not changed for several years. See, for instance,http://www.storagereview.com/articles/200601/WD1500ADFD_3.html which isincorporated herein by reference. Assume for this discussion that thereis a technological breakthrough and that a fast hard disk is capable ofrandom access seeks that take “only” a millisecond.

Random access seeks in RAM are measured in nanoseconds. A fairly slowmodern computer can do a non-localized seek in about 70 nanoseconds.Assume 100 nanoseconds for simplicity.

Thus, the ratio of random access seeking our very fast (currently,theoretical) hard disk to our relatively slow RAM is 10⁻³/100×10⁻⁹;which is ten thousand to one.

A more reasonable modern figure would be about a hundred thousand toone.

As those familiar with the art understand, if the comparison is betweenthe situation where the disk drive's head is near the center of thedrive and the data to be sought is towards the outer edge and the datais in a high-speed RAM cache because of locality of reference, then theratio can be well in excess of a million to one.

In either event, the lesson that many others and we teach is thatseeking on disk is to be avoided if at all possible if there are a largenumber of seeks to be performed because of the large cost in the timedomain.

XI. N-Bytes

Smaller is Usually Better

One can view each n-byte-block as a probe into the Reference and CurrentFiles. In order to find the maximum number of Runs, one wants to makeN-bytes as small as possible.

To demonstrate this, assume that the Reference and Current Files havethe following Contents:

Exhibit 1 Two files in which only the first byte is different Offset inFile 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Reference A B C D E F G H I J KL M N O Current X B C D E F G H I J K L M N O

XI.(1): In Our First Scenario, Assume that N-Bytes is 5

When N-bytes is 5 then

bytes 0-4 do not match and the two hash codes for bytes 0-4 are notlikely to match.

bytes 5-9 do match and the hash codes for bytes 5-9 will match.

bytes 10-14 do match and the hash codes for bytes 10-14 will match.

Under this circumstance, both the Inventors' Method as well as themethods used by others knowledgeable in the art will detect two matchesof five bytes each and would easily be able to eliminate ten bytes ofredundancy.

XI.(2): In Our Second Scenario, Assume that N-Bytes is 15

Because a hash would be computed on two different 15-byte arrays, it ishighly unlikely that the two hash codes will match. Even if by someextreme statistical improbability the two hash codes did match, theunderlying fetch of the two 15-byte arrays would show an immediatemismatch.

With N-bytes equaling 15, no redundant data would be detected.

XI.(3): Conclusion

For those familiar with the art, it should be obvious that one wantsN-bytes to be as small as possible in order to detect as many Runs aspossible.

As a matter of empirical evidence, in the Preferred Embodiment, toachieve the best deduplication performance in terms of eliminatingduplicate data and keeping the size of the Extended Reference File assmall as possible, the best value for N-byte is in the range of 512 to2048.

The Inventors speculate that the reason the values lower than 512 reducededuplication performance is that as the value of N-byte becomes smallerthen the number of places in the Extended Reference File matching ann-byte block increases.

Since the Method must pick some Run, it picks the “wrong Run”. SeeSection [0267] XIII for a discussion of duplicate hash values.

To illustrate this, consider reducing N-byte to a value of 1. In thiscase, the DEDs would almost certainly be pointing to a an effectivelyrandom place in the Extended Reference File and the fact of the 1-bytematch would not signify a high probability of a meaningful Run at thatlocation (a Run whose length is, say, more than 100 bytes), thus wastingthe effort needed to seek to the location in question.

XII. Ram as a Limiting Factor N-Bytes Versus Digest Data StructureMaximum Hash Count (DDSMHC)

Assuming that one is implementing the methods used prior to theInventors' Method, then the problem with making N-bytes as small aspossible is that it increases the RAM needed to support the number ofDEDs (e.g. hash table entries) one needs to store into the Digest DataStructure (DDS).(e.g. a hash table).

For instance, assuming

The hash codes generated do not generate the same hash code for twodifferent byte arrays

The hash codes have perfect distribution properties

The Reference File is 1000 bytes in size

Then assuming that N-bytes is 100, we would need 10 DEDs inserted intothe DDS.

On the other hand, if N-bytes is 10, we would need 100 DEDs insertedinto the DDS.

As should be readily apparent from these examples, the smaller N-Bytesis, the larger the DDSMHC has to be.

Because RAM is a precious commodity, it is not unusual for implementersof the current art to make N-bytes quite large: 8 k is not unusual.

We cite fromhttp://www.hifn.com/uploadedFiles/Products/Solution_Bundles/Data_De_Dupe/-HifnWP-BitWackr-2.pdf)

2.3 Data Deduplication

Data deduplication is a technique that eliminates redundant blocks ofdata. In the deduplication process developed by Hifn, blocks of data are“fingerprinted” using the Secure Hash Algorithm (SHA-1), a trustedalgorithm that is widely employed in security applications andcommunications protocols. Hifn implements the SHA-1 algorithm inspecialized silicon that produces a 160-bit hash (also referred to as a“digest” or “fingerprint”) from fixed length data blocks with blocksizes between 4 KB and 32 KB.

Thus those familiar with the art will readily see a major improvement tothe current art in that our arbitrarily small probe—currently 512 bytesin the Preferred Embodiment of the Invention—will likely detect far moreRuns than those who set N-bytes to a larger value.

XIII. Duplicate Hash Values

Given an n-byte-block for which a hash value has been computed, it ispossible for the hash value to be a duplicate.

This can be the result of two possibilities.

First, two n-byte-blocks with different contents can hash to the samevalue. This is the result of the so-called “Counting Problem” whichshows that all hash functions must generate collisions if the number ofbits in the hash value is smaller than the number of bits in the input.By enumerating all the inputs and listing all the hash values one caneasily see this. Eventually, there must be duplicates in thismany-to-one mapping.

Second, and somewhat more problematic, it is possible for the samen-byte-block (by content) to appear in more than one position in eitherthe Reference File or the Current File. We discuss this further insection [0307] XVII.

This is easily seen, for instance, if a user made a copy of a filelarger in size than N-bytes and took a non-compressed Image Backup ofthe disk or disks on which the two identical files are stored. The filewould appear in two places on the disk and thus one or moren-byte-blocks would be duplicated.

We presume the way that most synchronizers work is—when a duplicaten-byte-block is found—that a reference to one of the copies is storedalong with a length-in-bytes or, almost equivalently, alength-in-blocks.

It is, of course, possible for our Method to examine both positions inthe Extended Reference File and determine which sum of Run Lengthsproduces the smallest sum of sizes of Output Packets.

XIV. Difference in Speed and Efficacy Between Rolling Hashes (Hash Codeswith Poor Statistical Properties) and Non-Rolling Hashes (BetterStatistical Properties)

As those familiar with the art understand, as far as hash tables andhash codes are concerned, a good hash code has certain statisticalproperties, the object of which is to take random input and spread thatinput uniformly through a integer range of values so that theprobability of two inputs (e.g., n-byte-blocks) having the same outputis minimized. One typical offshoot of this good-hashing property is theso-called avalanche effect in which if a single bit of the input ischanged then, on average, half the bits of the output are changed.

A non-Rolling Hash (e.g. MD5) is relatively good at achieving the goalof uniform distribution compared to any Rolling Hash with which theInventors are familiar. Of course, the reason one might use a RollingHash instead of a non-Rolling Hash is the computational expense ofnon-Rolling Hashes.

Thus the method we teach uses the same technique that rsync and othersdo: where appropriate, a Rolling Hash is computed instead of anon-Rolling Hash because it is guaranteed that if two blocks havedifferent hash values—whether rolling or non-rolling—then the contentsof the two blocks are different. As we teach, above, if the two hashvalues are the same then there is a good chance that the contents of thetwo blocks are the same but the only way to guarantee this is toactually compare the two blocks, generally by having to fetch them fromdisk.

In the Preferred Embodiment of the Method, we use Rolling Hashes toallow us to rapidly scan for Runs in data that may be similar except forsome number of inserted and/or deleted bytes. We discuss this further,below.

XV. The Logical Structure of the Extended Reference File

The term “Extended Reference File” is a term of convenience.

In the Preferred Embodiment of the invention, it is a single file knownto the operating system. We teach that it need not be a single file andthere are good reasons why one may not want it to be a single file.

Consider for a moment a Reference File for which we are certain there isno Intrafile Duplication and thus a waste of time to performdeduplication; or the Reference File is read-only and is, thus,unmodifiable.

The Method we teach could easily place new output packets in a separatefile. The question then becomes as to how to refer to a location in theReference File or the file with the new output packets?

As those familiar with the art would immediately see, there are a largenumber of design choices to solve this minor problem. We discuss twopossibilities.

In the first possibility, the output packet contains two fields. (1) Afile reference, and (2) a byte position in the file.

In the second possibility, the output packet assumes that the ExtendedReference File is a single file but that there is a side table thatallows the Method to translate from an offset into a reference to (1) Afile reference, and (2) a byte position in the file.

Let us explore the second possibility.

Let us say that the Reference File is 1,000,000,000 bytes in length andthat the output file where output packets are to be stored is called theOutput Packet File. If in the DED a position of 1000 is detected then itis obvious that some data from the Reference File is to be read. If, onthe other hand, the position in the DED is 1,000,001,000, then data isto be read from the Output Packet File.

As a matter of definition, the Output Packet File is part of theExtended Reference File.

As should be immediately apparent to those familiar with the art, thiscan be extended to any number of files. All that need be maintained aredata structures equivalent to an ordered table of file references (e.g.file names) and file lengths.

XVI. The Packet Structure of the Extended Reference File

To repeat, the two inputs are the Reference File and the Current File.The output is the Extended Reference File that consists of an OldReference File followed by zero or one Chapter packets followed by acollection of zero or more non-Chapter Output Packets followed by aChapter Packet. A chapter packet is a special packet that marks the endof a Chapter and is the last packet written before the Chapter is closedand the File is closed.

Using a standard C syntax, a simpleminded Output Packet might have thefollowing format:

Exhibit 2 struct IdenticalBlock {   int BlockNumber; }; structDifferentBlock {   int nBytes;   unsigned charStringOfDifferentBytes[1]; }; union PacketUnion {   structIdenticalBlock;   struct DifferentBlock; }; struct OutputPacket {   charBlockIndicator;      /*A flat indicating what kind of Packet we have */  union PacketUnion; };

Those familiar with the art would immediately understand thatIdenticalBlock is a fixed-length data structure that represents (i.e.points to) a block in the Reference File. If, for instance, ann-byte-block is 512 bytes in length, an int is eight bytes, and a boolis one byte, then an OutputPacket containing an IdenticalBlock couldrepresent those particular 512 bytes in nine bytes.

From the simpleminded data structure in Exhibit 2, we slightly modifyIdentical Block.

Exhibit 3 struct IdenticalBlock { int BlockNumber; int nBlocks; };

By adding int nBlocks, we can create a reference to a Run of contiguousmatching data blocks in the Reference File and Current File.

Thus if there is a run of, say, one hundred blocks that match, this canbe represented—using the data structure in Exhibit 3—by a singleOutputPacket of seventeen bytes rather than one hundred OutputPacketswhich would have a total length of nine hundred bytes.

Adding the field int nBlocks has consequences that are not immediatelyobvious. This will be discussed below.

In the Preferred Embodiment, the IdenticalBlock structure implementationlooks more like this:

Exhibit 4 struct IdenticalBlock { int StartingPositionInReferenceFile;int Length; };

That is, there is no necessity for the identical section to start on ablock boundary. This allows partial block matching, which can improvethe efficiency of the representation and increase the amount of dataremoved by deduplication.

Another optimization can be added for zero-filled areas, whichapparently occur frequently in some common types of files. The datastructure to represent a block of zeroes might look like this:

Exhibit 5 ZeroBlock.       struct ZeroBlock       {       int Length;      };

XVI. 1 Structure of the Matching Data Packet (MDP)

For purposes of this specification, the OutputPacket where theBlockIndicator indicates that we have matching data is known as aMatching Data Packet (MDP).

XVI.2 Structure of the Literal Data Packet (LDP)

For purposes of this specification, the OutputPacket where theBlockIndicator indicates that we have non-matching (i.e. “literal data”)data is known as a Literal Data Packet (LDP).

XVII. Duplicate User Embedded Data

Consider two large and nearly identical user files that happen to beslightly different at the end of the files. These files are actually orvirtually stored in the Reference File. Further assume, for simplicity,that the two user files are defragmented.

The problem for all synchronizers that happen to use Hash Codes (andeven for those that don't) is to determine which copy of the file (orn-byte block) to point to when a copy of the file (or n-byte block) isfound in the Extended Reference File.

Presuming that there are two n-byte-blocks with the same content in theExtended Reference File and that the synchronizer goes through a phasein which n-byte-blocks are stored in a DDS then there are at least fourpossible strategies that can be used to store the corresponding DEDsinto the DDS. (A) Store the first occurrence and ignore any furtheroccurrences. (B) Store the last occurrence and throw away previousoccurrences. (C) Store all occurrences of the DED into the DDS and dealwith the extra complications as well as the overhead. (D) Randomly addthe DED, by, say, generating a random number between 1 and 100 and thenadding the DED if the number generated is above 80.

In the Preferred Embodiment, the first occurrence of the DED is storedand all newer ones are not stored. While this may not be optimal it isgenerally “good enough”; a great deal of deduplication will beforthcoming regardless of which copy is selected, because many of theparts that are the same between and among the files will be picked up.

XVIII. The Naive Method Possibly Used by Most Other Synchronizers(English)

The basic idea behind most synchronizers is to create a Digest DataStructure (DDS) that represents a collection of hash values as well asother data, such as the position where the block can be found in one orboth files to be synchronized. This DDS contains a digest for eachn-byte-block. The DDS is usually of the form of a hash table and eachdigest is usually therefore the result of a hashing algorithm such asMD5 or CRC but, in any event, a hashing algorithm generating acollection of digests and stored in a DDS is used to shrink the size ofthe data to be manipulated; this is done because working with the rawdata rather than a digest results in computation times that are so longas to be infeasible.

The two files are often called the “Reference File” and the “CurrentFile”. Conceptually, the Reference File could be a backup image taken onMonday and the Current File is a backup image taken on Tuesday.

In general, the Reference File is divided into n-byte-blocks and eachblock is represented by a hash value (CRC, MD5, SHA-1, etc.). Then theCurrent File is scanned (often byte-by-byte) and the same hashingalgorithm is used to compute a hash value for any n-byte-block.

The assumption is generally made by most other synchronizers that twon-byte-blocks having the same hash value are, indeed, the same block. AsShnelvar (6374266) and many others (e.g.http://infohost.nmt.edu/.about.val/review/hash.pdf, which isincorporated herein by reference) have pointed out, this may be aninvalid and highly dangerous assumption; nonetheless, this is theassumption that is made.

The inventors, anecdotally, are aware of the following: the well-knownrsync program has an extra check to reduce the probability that twon-byte blocks having the same hash value (i.e. digest value) are,indeed, the same block using a technique roughly equivalent to thefollowing:

a. rsync computes a Good Hash on the whole “newer” file and transmitsthat along with the data.

b. If this whole-file hash doesn't match on the receiver, then rsyncdoes the whole operation again with a differently-seeded hashingalgorithm.

c. Repeat above until the whole-file hash does match.

Of course, this still isn't a guarantee of success, but it isn't aslikely to fail as the naive approach described above.

That is, rsync computes a whole-file Good Hash and transmits it to thereceiver. rsync then calculates the Good Hash of the resulting file onthe receiver. If they don't match, they redo the whole transmission witha slightly different hashing algorithm that should not have the sameerroneous “equivalent block” problem.

Returning now to the Method that we teach, if a hash match is found, allthat need be indicated is that the n-byte-block of data can be found ata particular offset in the Reference File. The algorithm continues untilthe Current File becomes exhausted.

Often, the list of offsets and unassociated data (i.e. data found in theCurrent File but not in the Reference File) is transmitted to a secondcomputer so that the second computer can update its copy of theReference File in order to create a copy of the Current File. This listof offsets and unassociated data is often called a Patch File.

Of course, the unassociated data could be passed to a zip-libecompressor to lower the overhead of transmitting many byes. As thosefamiliar with the art know, if the data to be compressed has alreadybeen compressed (e.g. the data is part of a jpeg or a zip file) then itis likely that the “compression” will produce results larger than thealready-compressed data.

Central and implicit to the method(s) described, above, is that theDigest Data Structure (DDS) fits in RAM. The DDS, usually of the form ofa hash table, needs to fit in RAM because access to an element in thehash table needs to be quick in order to produce results competitivewith other implementations. As those familiar with the art understandand as we taught in Section [0234] X, the difference in speed betweenaccessing an element of a hash table in RAM versus a random accessseek-and-read on a disk drive can be in excess of a million to one.Access across a network is often even slower. Therefore, placing the DDS(e.g. a hash table) on a relatively slow non-RAM device makes thecomputer program almost certainly impractical and uncompetitive.

Allowing the DDS to grow so that the DDS is placed in virtual memorymerely shifts the problem of disk-based seeking from the applicationprogram to the operating system. There would be little if any gain inspeed if the DDS were not part of the Working Set.

The Inventors wish to explicate the previous paragraph. What should bewell-known to those familiar with the art is that hash codes have thedesired property that given an input block for which a hash code iscomputed, the value of the hash code is, for all intents and purposes,random in the range for which the hash values are computed. Thisdesirable property spreads the hash codes evenly through a hash table,thus reducing so-called hash table collisions. The undesirable sideeffect of this property is that if a hash table is placed in virtualmemory, there would be much page file thrashing as the operating systemconstantly brings in pages of memory.

We now return to an explanation of how synchronizers work.

Consider a DDS which is a hash table using an MD5 as the hashingalgorithm. Assume that the digest consists of MD5 entries. Each entry inthe hash table is at least 128-bits long because an MD5 hash is 128-bitslong; and each entry is generally longer in order to capture informationsuch as block position in the Reference File. For simplicity, assumethat each hash table entry is 128-bits long (16 bytes).

Most modern commodity computers (circa 2009) contain on the close orderof a gigabyte of RAM. If this amount of RAM were completely committed toa hash table (thus leaving no room for the operating system andoperational programs), then the hash table could represent at most6.2×10⁷ (==10⁹/16) blocks. If each n-byte-block were 512 bytes, then theabsolute maximum size representable by such a hash table would be 32gigabytes.

32 gigabytes would be considered to be a large file by most modern homePC users. Note, though, that 32 gigabytes would roughly represent thesize of an image backup for a common 60-gigabyte disk that wasrelatively full, given a 50% compression ratio.

Note that this 32-gigabyte limit is an absolute maximum for a computerwith a gigabyte of RAM. The practical maximum is far below this for thefollowing reasons.

First, if the hash algorithm used is SHA-1 instead of MD5 then eachentry will be 20 bytes in length rather than 16. A hash table consistingof only SHA-1 entries could represent only 25.6 gigabytes rather than 32gigabytes.

Second, as noted, above, each hash table entry will be larger than justthe (CRC, SHA-1, MD5, etc.) digest. In the present Invention's PreferredEmbodiment, each hash table entry is sixteen bytes long: Each entry inthe hash table is a DED, which contains the following:

Exhibit 6 unsigned LowerHasValue; unsigned BlockNumberInReferenceFile;longlong GoodHashValue;

where “LowerHashValue” is the low 32 bits of the calculated “fast”Rolling Hash value (to be described, below); “BlockNumber” is the blocknumber in the Extended Reference File to which the hash table entrypertains; and “GoodHashValue” is the CRC-64 (“good”) hash value for thatblock. However, if a larger hash value, such as MD5 or SHA-1 were used,then clearly this data structure would be larger than 16 bytes.

Third, as noted, above, RAM is used for items other than the hash table.In order for most programs to work efficiently RAM must be allocated forthings such as file buffers, frequently executed operating system code,and other data structures.

Fourth, as those familiar with the art understand, hash table collisionsdue to clustering render a hash table less-and-less efficient as thehash table become populated. As is readily apparent to those familiarwith the art, as the Subscription Ratio approaches 100%, the odds of twohash values mapping into the same hash bucket approaches 100% even forthe best hash functions. As the table becomes more full, more-and-moresophisticated algorithms (which, of course, become computationally moreexpensive) are required to handle hash bucket collisions. As wereported, above, in Section [0339] XX, one vendor recommends that theSubscription Ratio (SR) be approximately 0.30. In the Inventor's method,we achieve very good deduplication performance with SR's exceeding 30; afactor of 100 difference.

Modern rsync-like programs seem to have a practical upper bound of muchless than a gigabyte. The Preferred Embodiment of this invention has apractical upper bound of at least hundreds of gigabytes.

XIX. The Naive Method Probably Used by Most Other Synchronizers(Pseudocode)

Those familiar with the art will be able to understand the C-likePseudocode syntax, below.

Exhibit 7 Pseudocode in which edge conditions have been ignored 001 //NbytesPerBlock: Number of bytes that the algorithm uses to 002 // “chunkup” the Reference File and Current File. 003 // Typical value is 512 004005 // 006 // build DDS for Reference File 007 // 008 While there isdata left to read in the Reference File; 009 Read NbytesPerBlock into abuffer; 010 Computer hash value of buffer; 011 Insert has value andblock number into DDS; 012 013 014 015 // NbytesPerCurrentFileBlock:Number of bytes that the algorithm 016 // uses to “chunk up” the CurrentFile so that the Current file can be further scanned 017 // byte-by-bytein order to locate blocks of data to be found in the Reference 018 //File. 019 // The Preferred Embodiment uses a tuneable value of 256Kbytes. 020 // Almost any value larger than NbytesPerBlock can be used.021 // WE would expect larger values of NbytesPerBlock to increase 022// performance. 023 024 // 025 // Read Current File and compare toReference File 026 // Write to Extended Reference File (ERF) 027 // 028While there is data left to reach in the Current File; 029 ReadNbytesPerBlockReferenceFile into BufferA; 030 PtrBufferA = start ofBufferA; 031 While there is data left to process in BufferA; 032 Peeloff NbytesPerBlock and place into BufferB; 033 HashOfBufferB = Computehash of BufferB; 034 if HashOfBufferB exists in DDS 035 { 036 If thereis any data in the Temporary Literal Buffer (TLB) 037 { 038 Emit aLiteral Data Packet (LDP) based on 039 the TLB to the Extended ReferenceFile; 040 Clear the TLB; 041 } 042 Create a Matching Data Packet (MDP)043 Emit the MDP to the ERF 044 PtrBufferA + = NbytesPerBlock; 045 } 046else 047 { 048 Store byte at PtrBufferA into the TLB; 049 PtrBufferA++;050 }

XX. Novelty of the Method with Respect to Oversubscription of the DDS

The following non-obvious insights are central to the novelty of thecurrent invention.

First, and foremost, is the insight that the DDS can be oversubscribed.That is, it is not necessary for the DDS to contain every hash for everyn-byte-block. Indeed, a Subscription Ratio of 30 is fairly typical asimplemented in the Preferred Embodiment.

To demonstrate the extreme novelty of this portion of the Invention, wequote fromwww.hifn.com/uploadedFiles/Products/Solution_Bundles/Data_De_Dupe/HifnWP--BitWackr-2.pdf(which is incorporated herein by reference):

“-s, --hash_size N

Set the hash table size in N number of bits. The number must be between16 and 32. The hash table will contain 2 N entries, e.g., N=16 will seta hash table with 65536 entries.

Each entry is a “hash bucket” which can contain a Block Address thatpoints to a Dedupe Data Block. However, due to the hashing algorithmonly a small percent of the entries can be filled before truncated hashduplicates occur, which will greatly affect the performance of thesystem. Generally one would want to select a hash size value that islarge enough so that the volume size in blocks is less than 30% of thetotal hash table entries.”

What the above tells us is that a manufacturer of hardware to do datadeduplication uses the “standard” method of undersubscribing the hashtable and suggests that the users specify a hash table size such thatthe DED's only consume 30% of the hash table (i.e. the DDS).

In the Preferred Embodiment of this invention, the DDS, like the oneused by hifn (quoted, immediately, above), is a hash table.

The second non-obvious insight is that heuristic techniques can be usedto find “runs of common blocks” so that the duplication of data in thesecommon blocks can be eliminated and a reference-and-a-length can besubstituted for the data. Substituting a reference and a length in orderto eliminate duplication is well known to those familiar with the art.

The basic insight regarding the heuristic used in the PreferredEmbodiment is that matching hash codes can be used to give clues as towhere identical runs of matching data might be found. Various techniqueswill be taught (See section [0443] XXXI. Using Heuristics to Determinewhere Runs of Identical Data can be Found)

Third, where the Repository is a device in which seek time iscomputationally expensive, this method uses the hash values to determineif the expense of seeking should be done rather than assume that twohash values that are identical point to identical data. This allows theMethod to use computationally inexpensive hashes (e.g. Rolling Hashes)with poor theoretical properties with respect to collisions.

XXI. The Inventors' Method is Able to Find Runs of Data at PositionsOther than Those Divisible by N-Bytes

We now answer the question posed in [0101] [0035] V.(11) Summary—Whatabout insertions and deletions?

We now teach how the Inventors' Method is able to detect Runs that donot start on positions in the Extended Reference File divisible byB-bytes.

For purposes of this explanation, assume that the cost of computing aCRC64 for an n-byte-block (e.g. N-bytes of data) is zero. Further,assume that we read in a B-byte-block from the Current File (in thepreferred embodiment, 256K in size).

Then for each byte for which we can compute a CRC64 (that is,256K-N-bytes+1) we compute a CRC64. We then search the DDS (that is, thehash table) and check if the hash code computed for the byte matches onein the DDS.

If there is a match on the CRC64's then we proceed to read thecorresponding n-byte-blocks and check for a Run. If a Run is detected,then we perform an Expanding-The-Run operation.

XXI.1 Expanding-the-Run Explained

Assume that N-Bytes is 512 bytes and that in the process ofExpanding-The-Run that 100 bytes of identical data are found prior tothe n-byte-block and 200 bytes of identical data is detected followingthe n-byte-block, then we create and emit a Matching Data Packet (MDP)the fact that 812 (100+512+200) bytes of common data was found. We canthen advance the pointer where we calculate CRC64's 200 bytes.

When we emit the MDP, we also advance the Current File Wall (CLW).

It is a novel feature of the Inventors' Method that the Method can andwill compute hash codes in the Current File for, possibly many,positions other than positions divisible by N-bytes. When and if one ormore of these hash codes are found to match entries in the DED, eachsuch matching DED provides a block position in the ERF which is thenused as a starting point by an Expanding-The-Run operation. This allowsthe Method to pick up Runs at positions other than those divisible byN-bytes.

Since on a standard IBM PC-type of computer there is a significant costto computing CRC64's, the Preferred Embodiment does things somewhatdifferently.

The following is a simplification of what is actually done in thePreferred Embodiment; but what we teach here is, except for someoptimizations, logically equivalent to the Preferred Embodiment.

In the Preferred Embodiment, the DDS is a hash table whose entries areindexed by a subset of the bits of the value of a Rolling Hash. The hashtable entry contains fields for

An n-byte-block location

the value of Rolling Hash for the n-byte-block location

a CRC64 for the n-byte-block.

As we have taught, above, CRC64s (and MD5's and SHA-1's, etc.) arecomputationally expensive. Relatively speaking, Rolling Hashes are verycheap to compute when one is computing a collection of them where thecollection represents overlapping n-byte-blocks in which the blocks areseparated by, say, one byte.

In the Inventors' Preferred Embodiment, most of the computation of aRolling Hash can be accomplished with three additions, two subtractions,and two shift operations. This operation is computationally veryinexpensive.

In the Preferred Embodiment, the following operations are performed:

The DDS is keyed off the Rolling Hash. If a match is found on theRolling Hash value then for the n-byte-block in the Current File, aCRC64 is computed. If the CRC64 computed for the n-byte-block in theCurrent File matches the CRC64 computed for the n-byte-block in theExtended Reference File, then an Expanding-The-Run operation isperformed.

The purpose for the extra complication of computing a CRC64 rather thandepending exclusively on Rolling Hashes is that, on computer hardwarecirca 2009, the Inventors have found (as we have taught in Section[0234] X Seek and ye shall find it slowly) that it is computationallyfaster to compute a CRC64 than it is to access the data on a disk drive,and the reduction of “false positives” from Rolling Hash matches morethan makes up for the additional computation time for calculating theCRC64.

XXII. Improvements Over Prior Art

(1). Modern rsync-like programs seem to have a practical upper bound ofmuch less than a gigabyte. The Preferred Embodiment of this inventionhas a practical upper bound of at least hundreds of gigabytes. Thereason for this is that methods like rsync or 6374266 require sufficientRAM so as to have room for the equivalent of a DED for each block ofdata in the Reference File. As those familiar with the art can see, onecan shrink the needed size of the DDS by increasing the size of theblock corresponding to a DED; but then the probability of findingmatching data is reduced and thus redundancy elimination and compressionare reduced in efficacy.

(2) rsync-like programs assume that two “good hash codes” that happen tobe equal point to the same content. Our Method does not do that but,instead, uses the hash codes to determine if a seek that iscomputationally expensive in the time domain should be performed. Thusour Method is guaranteed to produce correct output while rsync-likeprograms are only likely to produce correct output.

XXIII. Reversible Digests

We mean to use the phrase “Reversible Digest” in one of three senses.

(1) An algorithm such as “Generate an array of 1500 bytes of zeros.” Inthis case all that needs to be noted is a flag for the algorithm(generate an array of zeros) and the number of zeros.

(2) An algorithm, such as LZW, that might compress an array of bytes andfor which decompression recreates the array of bytes exactly.

(3) A “trivial” transformation of an array of bytes; such as reversingthe order of the first two bytes of the array.

The Inventors recognize that the Digests stored in table entries of theDDS need not be hash codes similar to CRC64s or SHA-1s; but could be,for instance, the underlying data itself.

We explain the paragraph, immediately above, further; by example.

Assume that the DDS is organized in such a manner that table entries arefetchable by key and where the fields of the DED comprise

A 64-byte key

A byte offset in the Extended Reference File Instead of matching on ahash code, an alternative embodiment of the Method could use 64 (orwhatever number is felt to be appropriate by User Option) bytes of theoriginal data to probe the Current and Reference Files.

In another embodiment, it might be found to be efficacious to use every16th byte of the n-byte-block as the reversible digest and to key off ofthis value to find matching Runs.

Again, the novelty of the Invention is that the DDS be oversubscribedand the form of the Digest is almost irrelevant.

Thus, when we use in the claims the phrase “Digest”, we mean both a hashcode and a Reversible Digest.

XXIV. Chapter 0 in the Extended Reference File is Special in thePreferred Embodiment

In the Preferred Embodiment, Chapter 0 contains no information relatingto user data, but is used only as a “dummy” chapter to which “real”chapters are later added. However, as a User Option, it would bepossible to store the actual data of Chapter 0 as “raw” data.

Indeed, in previous incarnations of the Preferred Embodiment, all thatwas done to create a Chapter 0 was that a Chapter packet was appended tothe raw user data. This was abandoned in the Preferred Embodimentbecause (1) It modified a user file and (2) it performed no usefulintrafile deduplication of a Reference File.

It would be a rather trivial operation to modify the Chapter Packet toreference files (See Section [0402] XXVII for an example of how thiscould be done.)

This would be desirable, in the not infrequent case, that the user knowsthat the data associated with the first Chapter of the ExtendedReference File (ERF) contains little that could be deduplicated. Thus,attempting to do Intrafile deduplication on this first Chapter is awaste of computer time.

In this case, the Method could create an ERF by appending a Chapter 0packet to the “raw” data. A DDS is then built from this ERF.

XXV. Adding the Chapter Packet to the Extended Reference File Assumingthe Extended Reference File is on a Worm Device

We extend the discussion we started in Section [0280] XV.

It is not an essential feature of the Method that the Extended ReferenceFile be on a WORM storage device but the Method readily accommodatesWORM storage devices. Once the final Decorated Packet representing thefinal bytes of the Current File has been processed, a special DecoratedPacket, the Chapter Packet, is emitted. The Chapter Packet may looksomething like this:

Exhibit 8 Possible structure of fields of a Chapter Pocket Chapter FlagPosition in Extended Reference File of the first Decoded Packet emittedfor the chapter Chapter Number Magic Number Human Readable descriptionof Chapter Spare Bits Reserved for Future Use

None of the fields following the “Position in Extended Reference File ofthe first Decorated Packet emitted for the Chapter “field are strictlynecessary. The fields starting at the Chapter Number are useful toestablish that the Chapter has been closed successfully as well asproviding human readable information about the chapter.

The Chapter Number is not necessary since the Chapter Number can becomputed by marching backwards through the chain of Chapter Packets andcounting the number of Chapter packets until a special “no precedingChapter Packet” flag is detected.

Having a Chapter Number in the Chapter Packet is useful so that theMethod can quickly determine the number of Chapters in the ExtendedReference File.

The Magic Number is not necessary assuming that the Method asimplemented in software and hardware has operated correctly. The MagicNumber is there in case, for instance, the hardware has failed. TheMethod is expecting a Chapter Packet as the last bytes of the ExtendedReference File. If the value in the Magic Number field is incorrect,then the Extended Reference File (ERF) is corrupted.

“Human Readable Description of Chapter” field only serves the purpose ofidentifying the Chapter for humans. It is not necessary for thesuccessful operation of the Method as implemented in software orhardware.

As anyone familiar with the art understands, the “Spare Bits Reservedfor Future Use” field is not necessary for the successful operation ofthe Method as implemented in software or hardware.

Even though the Chapter packet points to the “Position in ExtendedReference File of the first Decorated Packet emitted for the Chapter”,finding the beginning of the previous Chapter packet is easy since theChapter Packet is of fixed size. One merely subtracts the size of theChapter Packet from the “Position in Extended Reference File of thefirst Decorated Packet emitted for the Chapter” in order to find theprevious Chapter Packet in the chain of Chapter Packets.

XXVI. Adding the Chapter Packet to the Extended Reference File Assumingthat the Extended Reference File is not a Worm Device. Using a ChapterTable of Contents

As implemented in the Preferred Embodiment, the packets of datadescribed in Sections [0292] XVI and [0386] XXV can easily be stored ona WORM storage device because only new data is appended to the file.

Assuming that the implementer of the Method chooses to store the data ina rewriteable device, then it would be efficacious to create a ChapterTable that would list the starting position of each Chapter.

One of many ways to implement a Chapter Table of Contents would be tostore it as an array of

Exhibit 9 Possible Structure of an element of the Chapter Table ofContents Array Position in Extended Reference file of the firstDecorated Packet emitted for the Chapter Human Readable Description ofChapter Spare Bits Reserved for Future Use

As those who are familiar with the art are aware, this array of Chapterstructures can be implemented in any number of ways. The count ofChapters could be stored separately and the array could be stored in afile on disk.

In the alternative, the implementer or the user via a User Option couldspecify that the Chapter Table of Contents array could have a maximumsize and be stored at the beginning of an Extended Reference File.

XXVII. Structure of an Extended Reference File with Multiple ChaptersStored on a Worm Storage Device

Even though this section indicates that we are discussing ERFs stored onWORM drives, those familiar with the art understand that the techniquestaught here work equally well on non-WORM drives or devices that enforcea Write Fence.

As should be clear, the Process of processing Current Files and addingoutput packets to and Extended Reference File can be repeated as manytimes as is useful. The Method allows each Current File processed to bereconstituted separately. We call the Original Reference File “The FirstChapter” and the portion of the Extended Reference File containing theadditional data needed to reconstruct the first Current File added tothe Reference File as the “Second Chapter.”

Note that the First Reference File may have zero length.

For the purposes of clarity in the specification, above, we did not showthe Chapter Packet being inserted after the first Reference File. In thePreferred Embodiment, a Chapter Packet is emitted to the ExtendedReference File.

The logical structure of the Extended Reference File with five Chaptersin the Preferred Embodiment looks something like this:

Exhibit 10 (1) First (2) Chapter (3) First (4) Decorated (5) ChapterReference Packet of Decorated Packets Packet of the File the FirstPacket of representing Second Chapter (May have Chapter Second theSecond zero Chapter Chapter length) (6) First (7) Chapter (8) First (9)Chapter (10) First (11) Chapter Decorated Packet Decorated PacketDecorated Packet Packet of of the Packet of of the Packet of of theThird Third Fourth Fourth Fifth Fifth Chapter Chapter Chapter ChapterChapter Chapter

As we taught in Section [0280] XV, these five chapters could be splitinto any number of files. Of course, having these five chapters splitinto the five pieces—such that each piece is a Chapter—is one obviouspartitioning of the Chapters.

Assume that we wish, indeed, to partition the ERF into five separatefiles. The Method and the Preferred Embodiment could both be extended asfollows.

metafile which indicates the names of the five files in the correctorder is created. Let us call the file Metafile.txt. Using MicrosoftWindows filename syntax, the content of Metafile.txt might be somethinglike that of Exhibit 11.

Exhibit 11 (Lines starting with semicolons are comments. “cta” is thepreferred filename extension for chapters in ERFs.)   ; chapter 0 on alocal read-only device ; Note that the structure of Y:\First.cta could,in turn, ; contain a field that contains a file name that refers to the; actual file of undeduplicated data. Y:\First.cta ; Contains outputpackets representing literal and reference data. ; This is the thirdChapter: Chapter 2 \\RemoteComputer01/c/June/third.cta ; Stored on alocal drive ; This is the fourth Chapter: Chapter 3 c:\June\fourth.cta ;Stored on a local drive ; This is the fifth Chapter. Chapter 4c:\June\fifth.cta

Each “*.cta” file ends with a Chapter Packet.

It has been the experience of the Inventors that validating the internalconsistency of such a metafile and making sure that all the files areboth internally and externally consistent is a daunting but achievabletask.

As those who are familiar with the art understand, implementing suchmetafile handling adds considerable flexibility and power to the Method;allowing the implementers to create delta files so as to allow thesynchronization of huge files by merely shipping, say, c:Junefifth.ctaacross a network.

Extending the discussion of Section [0380] XXIV, using the techniquestaught in this section, it would become a trivial operation to implementhaving Chapter 0 either refer to no user data or have it point at a userfile that is on, say, a WORM device.

XXVIII Reconstituting a Chapter from an Extended Reference File

In the Preferred Embodiment, the Chapter Packets are appended to theExtended Reference File (ERF) as the last decorated packet after aCurrent File has been completely processed. In the example, below, weteach how to reconstitute a Chapter from a successfully created ERF.

Consider the ERF that we taught in [0402] XXVII. Assume that we wish toreconstitute the Third Chapter. That is, we are attempting toreconstitute the second file “added” to the base Reference File.

Referring to Exhibit 10 and Section [0386] XXV and using C-stylepseudocode, we would seek to

EOE—sizeof (Chapter Packet)

We would then read the Chapter packet representing

(11) Chapter Packet of the Fifth Chapter

We now are able to compute the position of

(9) Chapter Packet of the Fourth Chapter

since the Chapter Packet of the Fifth Chapter has a reference to thefirst packet

of the Fifth Chapter, and as explained in Section [0386] XXV, this has afixed offset from the previous Chapter Packet. We repeat this process inthe obvious manner to read

(7) Chapter Packet of the Third Chapter

Again, from “(7) Chapter Packet of the Third Chapter” we are now able toget

(6) First Decorated Packet of Third Chapter

because the field [0421] Position in Extended Reference File of thefirst Decorated Packet emitted for the Chapter is directly accessiblefrom “(7) Chapter Packet of the Third Chapter”.

We seek to “(6) First Decorated Packet of Third Chapter” and read theflag to determine what kind of Decorated Packet we have. In thePreferred Embodiment, we have several types of flags, but for thepurposes of teaching in this section of the specification we teach thatthere are only three types of Decorated Packet Flags

(1) Literal Data Packet

(2) Match Data Packet

(3) Chapter Packet

The Method reads packets one at a time.

If the packet type is a Literal Data Packet then the “literal bytes” areemitted to the Reconstituted File.

If the packet type is a Match Data Packet, then we seek to the positionin the Extended Reference File, and copy the number of bytes indicatedin the Length field of the Match Data Packet to the Reconstituted File.

If the packet type is Chapter Packet then we are done and the Chapterhas been reconstituted.

XXIX. The DDS can be Rebuilt

As we see from Section [0415] XXVIII, the DDS is never referred to whilereconstituting Chapters of original data.

It should thus be obvious that while the DDS contains valuable dataabout the Extended Reference File (ERF), a new DDS can be built givenonly the ERF.

Thus the DDS can be deleted at only the cost of rebuilding it.

XXX. Brief Conceptual Overview of the Inventors' Method

For illustrative purposes let us assume that an n-byte block is 256bytes long and that a B-byte-block is 256K (262,144 bytes) and, further,that the Current and Reference files are identical and of size onemegabyte (1,048,516 bytes).

If the DDS is a mere four elements in size (a ridiculously small sizebut we are attempting to illustrate the method), and, further if theoffsets selected for the four DDEs are as listed in Exhibit 12 then theMethod would proceed as follows:

Phase 1: A Reference File Analysis Phase (RFAP) is performed in whichthe DDS is built. The Reference File is read and a four-element arrayDDS is built. For purposes of this discussion assume that the four DEDsactually inserted into the DDS for four n-byte blocks are roughly in themiddle of each B-byte-block. We present Exhibit 12 which is the DDSsorted in order of offset in the reference file.

Exhibit 12 Hash Offset in table Reference Hash entry File Value 2 128K0x1232 1 384K 0x7731 3 640K 0x9AB3 0 896K 0xCAB0

As those familiar with the art understand, the hash values are,essentially, random values associated with the corresponding n-byteblock found at the “Offset in Reference File”. Those familiar with theart also understand that the order of insertion of hash values into thehash table is also, essentially, random. For this hash table assume thatthe low order two bits of the hash value are used as an index into thehash table and that there were a vast (4K-4) number of collisionsleaving only these four hash codes. Or, in the alternative, that wedivided the total Reference File length by 4, executed a seek to eachposition indicated in the table in Exhibit 12 and computed the hashesand that there were no hash collisions.

Phase 2: The Current File Redundancy Elimination Phase (CFREP) in whichthe Current File is examined in light of the data in the DDS built inthe RFAP.

Continuing our discussion, the data in Exhibit 12 (essentially, the DDS)is used as follows.

A B-byte-block (256K block) is read into memory. Conceptually, roughly256K (rolling) hashes are built and stored in a table; that is, one foreach overlapping n-byte block in the B-byte-block. The table is thenexamined (and since the Current File and the Reference File are definedto be identical for this illustration) we know that at offset 128K inthe Current File that we shall find a hash code whose value is 0x1232.

Having found a matching hash code it is a simple matter to bring in ablock from the Reference File. For convenience we bring in aB-byte-block starting at offset zero. Note that we could bring in anysize block that includes the n-byte block at offset 128K in theReference File. Assume for this illustration that we bring the firstB-byte (256K) block into memory.

Since the Reference File and the Current File are identical, we areguaranteed that the two n-byte blocks at offset 128K are identical, aswell. The simple version of this Method then compares the contents ofthe two B-byte-blocks by using offset 128K as an anchor from which todetermine the length of the run. In this particular case bothB-byte-blocks are identical so all that need be emitted to the ExtendedReference File is a notation that 256K bytes starting at offset 0 in theReference File is redundant and the redundancy can be eliminated.

The simple Method continues with the next three B-byte-blocks in asimilar manner. In this simple-minded example, four small packets ofinformation indicating the redundancy to be removed replace a megabyteof redundancy.

Of course anyone familiar with the art can see some obviousoptimizations. Perhaps the most obvious is that the Method—havingdetected a synchronization point at offset 0 in the Reference File—canExpand The Run by continuing to read forward until synchronization islost. In our particular example this would mean that the entire CurrentFile would have the redundancy eliminated and thus four packets would bereplaced by one.

XXXI. Using Heuristics to Determine where Runs of Identical Data can beFound

Once the realization is made that the DDS can be oversubscribed, thenthe fundamental problem becomes how to use the limited information inthe DDS to determine common runs of data. We describe and will claimseveral variants.

Before we do so, we make note of an optimization that is implemented inthe Preferred Embodiment. In the Preferred Embodiment the Current Fileis logically broken up into B-byte-blocks. A B-byte-block (mnemonically,a “big-number-of-bytes block) has a size that is, generally but notnecessarily an even multiple of n-bytes. The number of bytes in aB-byte-block will be claimed to be tunable. In the Preferred Embodimentits default value is 256K.

As is well known to those familiar with the art, reading a large blockof data into RAM from a Repository is likely to be considerably fasterthan reading sequentially and repetitively the same total number ofbytes as a number of blocks each containing a small number of bytes. Inaddition, a larger B-byte-block will provide more opportunities fordetecting runs of common data between the Reference and Current Fileswhen the subscription ratio is increased.

The number of bytes in a B-byte-block should be selected to takeadvantage of any hardware characteristics of the user's hardware. Forinstance, it may be that an optimal size for a B-byte-block is thenumber of bytes on a track of a disk drive or the number of bytes in adisk drive's on-board cache.

The purpose of a B-byte-block is to bring into RAM a section of theCurrent file to be rapidly examined without necessitating I/O that isexpensive in the time domain.

XXXI (1) Brute Force

The inventors claim this method.

If two digest values match between the Current and Reference Files, findthe positions in both files for the n-byte block corresponding to thosedigest values, then read forwards and backwards to find the boundariesof the run of identical data. Because of the nature of hash codes thenumber of bytes of identical data may be zero.

We repeat here what we wrote in Section [0211] VIII, the Inventorsassert that it is unsafe to assume that if two digests match that theunderlying data matches. Nonetheless, we teach that if the user of theMethod wishes to assume that if two digests match that, then, theunderlying data matches, and assuming that the user's assumption iscorrect, then this Method will produce correct output and the originaldata will be recovered.

If the underlying data does match for a certain number of bytes, then,optionally, the Method we teach could expand the match by an arbitrarynumber of bytes in both directions by using the method we taught inSection [0348] XXI. We also teach that we could use the method of U.S.Pat. No. 5,446,888 (Pyne) to detect insertions and deletions of smallnumbers of bytes. We also teach that there are many Insertion/DeletionAlgorithms known to those familiar with the art for detecting insertionsand deletions between two relatively small strings of similar binarydata.

For instance, assume that we find that 256 bytes of data in theReference File and Current file indeed match. Further assume that this256-byte block of data can be found at positions 50000 in the Referencefile and 150000 in the Current file.

The Method would then read forwards and backwards in the Reference andCurrent Files until there was a mismatch. Assume that mismatches werefound at offsets 40000 in the Reference File (140000 in the CurrentFile) and 60000 in the Current File (160000 in the Reference File).

The Method could then, optionally, extend the potentially matching dataarea by, say, 10000 bytes in each direction (that is, to 30000 and 70000in the Current File) and then use one of the Insertion/DeletionAlgorithms to see if only a small number of bytes has been inserted ordeleted near the matching data (40000 through 60000 in the ReferenceFile and 140000 through 160000 in the Current File).

XXXI (2) Using Multiple Hash Codes

The inventors claim this method as a dependent claim.

The method of [[0449] XXXI (1) Brute Force] may not be optimal becauseof the possibility that a particular n-byte block may be repeatedseveral or many times in either the Reference File or the Current File.As those familiar with the art know, there may be many instances of ablock of zeros on a user's disk. Similarly, other blocks might berepeated, especially if the user keeps multiple copies of files in theuser's Repository.

Because blocks of zeros occur so often, the Preferred Embodiment hasspecial handling for runs of zeros.

The Preferred Embodiment also implements a method that uses offsets ofblocks associated with hash codes to decrease the number of seeksnecessary to establish where long runs of common data may be found. Asthose familiar with the art understand, on modern computers seekoperations are computationally expensive and are thus to be avoided.

The method that we describe and which we will claim looks at thecorresponding relative offset of n-byte blocks associated with hashcodes.

We give an example.

Consider a Reference File with n-byte blocks beginning offsets andassociated hash codes. Assume that the hash codes represent uniqueblocks; that is, assume that there are no collisions.

Exhibit 13 Offset Hash Code 150,000 1234 250,000 1234 550,000 2345

Consider a Current File with n-byte blocks beginning offsets andassociated hash codes. Assume that the hash codes represent uniqueblocks; that is, assume that there are no collisions.

Exhibit 14 Offset Hash Code 450,000 1234 550,000 1234 850,000 2345

As should be obvious to anyone familiar with the art, detecting andaccounting for long runs is almost always better than processing acollection of short runs because there are fewer packets of commoninformation to transmit to the Extended Reference File. Also, there isthe reduction in the overhead in decoration.

Assuming that one processes the above two tables in the order listed(e.g., offset 100,000 is processed before 200,000), one could use themethod as indicated in [[0449] XXXI (1) Brute Force]. As is apparent ifone uses the brute force method, either (1) the brute force methodeither will not detect a possibly longer run, or (2) four seekoperations would need to be performed to analyze which run to use.

Assume that a naive brute force method was used. In this case theprogram would seek to position 150,000 in the Reference File andposition 450,000 in the Current File. The program would then search inthose vicinities for common data and would likely find commonality ofless than 100,000 bytes.

On the other hand, the fact that the distance between location 550,000and 850,000 in the Current File is the same as the distance betweenlocation 250,000 and 550,000 in the Reference file indicates thepossibility of a Run of 300,000 matching bytes between the ReferenceFile and the Current File at the respective locations in those Files. Ifthis is determined to be the case on comparison of the respectivecontents of the Reference File and the Current File, this commonality of300,000 bytes could be eliminated and a single packet of information(which is only a few bytes long) representing the common 300,000 bytescould be transmitted to the Extended Reference File.

XXXIII Memory Mapped Disk Files

A feature of, at least, some Microsoft operating systems is to allow theprogrammer to appear to load an entire huge file into RAM that issmaller than the file.

While this may reduce the programming effort, it merely defers to theoperating system the work of chunking the data.

XXXIII. Method (English)

As we did with the pseudo-code, this explanation ignores edge conditionsin order to teach the Method with some clarity. Those familiar with theart understand that the handling of edge conditions is where the majorwork (called, colloquially, “grunt work”) of programming occurs but isnot illustrative of method or the novelty of method.

XXXIII.(1) The Method Summarized

The Method can be summarized as follows:

Find two n-byte-blocks with identical hash codes. Move forwards andbackwards from the point of the match in the Reference and Current Filesin order to detect and Output Packets.

XXXIII.(2) the Method Optimized in the Face of Various ResourceAvailability Conditions

The objective of this Method is to minimize the time and computerresources necessary to eliminate the redundancy between two, possiblylarge, files.

Here is a partial a list of constraints circa 2009.

(1a) Disk seeks are on the order of 8 milliseconds. This is the majorconstraint in modern computers but we foresee technology in which thisconstraint is removed. That is, we foresee a technology in which therepository has many of the features of modern RAM but has the additionalfeature that it is both cheap and maintains its state when power isremoved.

(1b) RAM is relatively expensive compared to disk space. A gigabyte ofram costs approximately $20.

(2) It is computationally expensive to compute a Good Hash. In order toquickly compute a Good Hash special hardware may be employed.

1 a and 1 b are variants of “seeks are slow/fast.”

There are thus four possibilities that are presented depending onwhether (1) Seeks are slow or fast; and (2) Whether Good Hashes arecheap or expensive to compute. The Method will vary what needs to becomputed and its strategy for randomly reading from the Reference File.

XXXIII.(2)(1) Seeks are Expensive; Good Hashes are Expensive

Circa 2009, this is the common case for the typical computer.Nonetheless, the computation of a Good Hash is expensive but takes lesstime than a random seek.

Under this condition both seeks and Good Hashes are to be avoided andthe Method does this by using Rolling Hashes and other heuristics toavoid both of these expensive operations.

Thus the Method will store both a Rolling Hash and a Good Hash as itbuilds the DDS for the Reference File.

The Method will then use the Rolling Hash as an “initial match”.

The methods will then either (a) continue to compute more Rolling Hashesto see if there are more matches in the right order and at the rightdistances before committing to computing a Good Hash, or (b) Immediatelycompute a Good Hash (depending on either a heuristic to determine whichto use or a User Option).

Once it has been determined that a Good Hash should be computed, theGood Hash is computed and then the DDS is examined for a Good Hashmatching value. If a match(es) is found, then the B-byte-block(s)corresponding to the matched Good Hash is brought in and theB-byte-block(s) is/are searched backwards and forwards for a Run.

Those familiar with the art fully recognize that Runs may continuebackwards and forwards for many B-byte-blocks.

XXXIII.(2)(2) Seeks are Expensive; Good Hashes are Cheap

This is the next most likely scenario circa 2009.

Good Hashes can be made cheap either by a novel Good Hash algorithmbeing discovered or Good Hash dedicated hardware being added to thecomputer. See McLoone, above.

Under this condition seeks are to be avoided but Good Hashes are to beused.

Thus the Method will store only a Good Hash as it builds the DDS for theReference File.

The Method will then use the Good Hash as an “initial match”.

The methods will then either (a) continue to compute more Good Hashes tosee if there are more matches in the right order and at the rightdistances before committing to computing a seek, or (b) perform a seekin the Reference File (depending on either a heuristic to determinewhich to use or a User Option).

Once the Good Hash has been computed, the DDS is searched for a GoodHash matching value. If a match(es) is found, then the B-byte-block(s)corresponding to the matched Good Hash is brought in and theB-byte-block(s) is/are searched backwards and forwards for a Run.

Those familiar with the art fully recognize that Runs may continuebackwards and forwards for many B-byte-blocks.

XXXIII.(2)(3) Seeks are Cheap; Good Hashes are Expensive

Seeks can be made cheap by existent but as yet (2009) impracticallyexpensive technology, e.g., Solid State Disk (SSD).

Under this condition seeks are not to be avoided.

Thus the Method will store only a Rolling Hash as it builds the DDS forthe Reference File.

The Method will then use the Rolling Hash as an “initial match”.

Those familiar with the art fully recognize that Runs may continuebackwards and forwards for many B-byte-blocks.

XXXIII.(2)(4) Seeks are Cheap; Good Hashes are Cheap

Seeks can be made cheap by existent but as yet (2009) impracticallyexpensive technology, e.g., Solid State Disk (SSD).

Good Hashes can be made cheap either by a novel Good Hash algorithmbeing discovered or Good Hash dedicated hardware being added to thecomputer. See McLoone, above.

Under this condition neither seeks nor Good Hashes are to be avoided.

Thus the Method will store only a Good Hash as it builds the DDS for theReference File.

Those familiar with the art fully recognize that Runs may continuebackwards and forwards for many B-byte-blocks.

XXXIII.(3) Hash Codes in the Right Order and at the Right Distances

In the discussion in section [0476] XXXIII.(2), above, we referred to“continue to compute more (Rolling) Hashes to see if there are morematches in the DDS in the right order and at the right distances beforecommitting to a seek”. This deserves further explication.

Since—using current 2009 technology—both seeks (one can only do about100 random seeks per second on the typical personal computer) and GoodHashes are expensive, the Method attempts to minimize these costs in thetime domain by using heuristics to minimize the use of seeks and thecomputation of Good Hashes.

One technique used by the Method is to note that matches on the RollingHashes can appear in the right order and at the right distances.

For instance, let us say that a Rolling Hash match can be found atposition 0 in the Reference File (this positional information isextracted from the DDS) and position 1000 in the Current File.

The Method could continue computing Rolling matches and then detect thatthere is a Rolling Hash match at positions 512 and 1512 in the ReferenceFile (this positional information is, again, extracted from the DDS) andthe Current File.

Rolling Hash match may not be found at 1256 even though there is a Runthat includes that position, because the DDS may be oversubscribed atthat position.

Having detected there are now two Rolling Hashes that match in theproper order and at the proper distance, the Method may then choose tocompute one or more corresponding Good Hashes. It could then use theseGood Hashes to examine the DDS and determine if a seek should beperformed.

Note that the distances need not be exact because the Method may usebinary differences to emit decorated packets. Thus the Method may use aheuristic that says that the matches may be at N-byte offsets plus somearbitrary heuristically set value.

Thus, for example, the Method might decide that the second Rolling Hashat positions 514 and 1514 (as opposed to 512 and 1512) are matches at“in the right order and in the right positions”.

An obvious extension to this Method is to use three or more Rolling Hashmatches “in the right order and in the right positions” to determine ifa Good Hash is to be computed.

In fact, if we have enough Rolling Hash matches at corresponding placesin the data, we could skip the Good Hash generation and just read thedata, as the probability of several Rolling Hash codes matching at theright places generating a False Positive might be small. With currenttechnology this is unlikely to give much improvement because calculatinga few Good Hashes takes much less time than one disk seek, but that maynot be true in the future.

Those familiar with the art understand that where Good Hashes are cheapthat we could substitute Good Hashes for Rolling Hashes in thediscussion, above.

XXXIII.(4) Brief Overview

In the Preferred Embodiment, where seeks are expensive and thecomputation of Good Hashes is expensive (ee [0484] [0476] XXXIII.(2)(1),above) the Reference File is logically broken up into NN n-byte-blocks.A subset of the NN n-byte-blocks (whose count is DDSMHC) is selected tofill a hash table. The hash table contains an array of the followingdata structure:

Exhibit 15 unsigned LowerHashValue; unsigned BlockNumber; longlongGoodHashValue;

A B-byte-block of the Current File is read starting at byte offset 0 inthe Current File. Whereas an n-byte-block is typically 256 bytes, aB-byte-block may be 256K bytes. A B-byte-block is supposed to containmany n-byte-blocks although, theoretically, a B-byte-block could besmaller or larger than an n-byte block. Since computer files are not, ingeneral, an exact multiple of BN, there is a small chance that the lastB-Byte-block will be smaller than an n-byte-block. In our example, theprobability that a B-Byte-block will be smaller than an n-byte-block is1:1000.

We define here that the Match Finding Result Map (MFRM): is a mapcontaining one entry for each of the Rolling Hash matches that werefound for a specific b-block, together with their respective ERF blocknumbers and Good Hashes. The Inventors have coined this phrase.

The first n-bytes (e.g. 256 bytes, bytes 0 through 255) of theB-byte-block has a (rolling) hash code computed and this value is lookedup in the DDS. If it is found in the DDS, then that fact and theassociated Rolling Hash information is stored in the MFRM. Then the nextRolling Hash code, for bytes 1 through 256 of the B-byte-block, iscomputed and looked up in the DDS and the appropriate data is stored inthe MFRM. This continues until all of the Rolling Hashes in the b-byteblock have been processed.

Instead of calculating the Rolling Hashes “on the fly”, the Methodcould, equivalently, build a table of Rolling Hashes first, then searchthe DDS for each Rolling Hash in turn.

XXXIII.(5) Extended Reference File Ded Storage Overwrite Strategy(ERFDSOS)

As we stated in the definition for Extended Reference File DED StorageStrategy (ERFDSOS): Because the DDS is oversubscribed, as new OutputPackets are generated and new n-byte blocks of data are logicallyappended to the Extended Reference File, a strategy needs to be used todetermine when to overwrite older DEDs. As those familiar with the artwill readily see, there are many strategies that can be used todetermine when to replace older elements in the DDS with newer DEDsrepresenting newly created Output Packets.

There are many possible techniques for selecting which DEDs are to beinserted into the DDS. Perhaps the simplest is to select every SRthn-byte-block and then compute a DED for it and insert the DED into theDDS.

Since the RAM memory allocated to DEDs is assumed to be a relativelyprecious resource, we do not wish to waste entries by not filling them.If the DDS is a hash table then the algorithm used in the last paragraphmay not be optimal for multiple reasons.

One reason is that if multiple DEDs all hash to the same value then ahash table entry or entries will remain vacant and useful informationwill be lost. Let us give a possibly unrealistic example.

If the number of n-byte-blocks is 200 and the number of hash tableentries is 10, then this generates an SR value of 20. If the tenth andtwentieth n-byte-block hash to, say, the third entry in the hash tablethen the hash table will have at least one slot that is empty.

As anyone who is familiar with the art understands, there are numeroustrivial ways to fix this. Perhaps the most trivial is to check the entryin the hash table and, if occupied, ignore this n-byte-block and use thenext n-byte-block to compute a hash and fill an entry in the table.

This algorithm does, though, show that there is an interestingcomplication. Consider a Reference File that is composed of nothing morethan two identical sub-files; each sub-file located at an n-byteboundary. Since a sub-objective of the Method is to detect commonalitybetween a Reference File and a Current File, the question is: Which hashcode and associated Reference File or Current File offset associatedwith an n-byte-block (the first or a later one) is to be entered intothe DED (e.g. hash table)?

In our particular example, it doesn't matter because the two sub-filesare identical. But if they are slightly different then it mattersbecause this Method reduces redundancy by substituting apointer-to-an-offset-and-a-length in the Reference File for the actualcontents of the stream of bytes in the Current file.

This problem of collisions in the hash table can be eliminated, ofcourse, by using something other than a hash table to store DEDs. Bystoring the hash codes in a, say, a red-black tree, multiple DEDs withthe same hash code can be inserted and searched for in O(log n) time.

Using a red-black tree eliminates many of the complications of multipleidentical hash values being inserted but, clearly, any Database MappingEngine could be used to (1) detect if a hash value exists (the hashvalue is used as a key), or (2) fetch the data associated with theparticular hash value.

As those familiar with the art of Programming are familiar, generallythere is no a priori optimal strategy for performing any operation.Generally, if there are several possible strategies, the implementers ofa method will pick a default one and allow for one or more User Optionsto select which strategy to use.

For instance, in a word processing program, a user option might be howoften—measured in minutes—to save a copy of the document being workedon. There is no a priori optimal answer and the user sets a User Optiondepending on the user's preferences.

Thus the user might have a preference to have no DEDs substituted in theDDS as Output Packets are added to the Extended Reference File (ERF). Byselecting this option, the user may be indicating that she has acollection of Current Files to be added as Chapters that are variants ofa base Reference File.

Let us explain further what we mean by the previous paragraph.

Consider a company that has purchased, say, ten identical computers withidentical disk drives containing identical data on January 1 before anyuser even turns on any one of these ten computers. Clearly only a singlemachine needs to have a backup taken since on January 1, all the backupswould be identical because all the data on all ten disk drives would beidentical.

On January 2^(nd), ten users have received emails, installed some games,downloaded some pictures of each person's family, etc. In other words,the users have personalized their particular computer.

Now consider the optimal strategy for creating ten new Chaptersrepresenting each of these ten machines.

Under the conditions specified, above, it would not make sense to addDEDs to the DDS from new Output Packets because we wish to bias the DDSto have the DEDs from the base Reference File. The reason for this isthat new Output Packets will have data from different users and it isextremely unlikely that the pictures of one user's family would beidentical to pictures from another user's family.

By replacing the DEDs under this situation, we would have fewer DEDspointing to identical data and thus deduplication efficiency would bereduced.

On the other hand, if instead of ten Current Files representingdifferent users, we have ten current files representing Image Backups often consecutive days of the same user, then we would want to bias theDDS to represent newer data so that newer data from more recent would bemore likely to be deduplicated.

XXXIV. Intrafile Deduplication

Nearly anyone familiar with computers is familiar with zip files. Zipcompression will take a single file and compress the file and will,generally, produce a smaller output file than the input. As those whoare familiar with the art are fully aware, there are innumerablecompressors know with various speed and compression characteristics.Well known are ARC, LZW, ARJ, RAR as well as many others.

Our Method can also perform intrafile compression for all chapters bymaking the initial Reference file a zero length file (aside from thechapter packet) and using an Extended Reference File DED StorageStrategy (ERFDSS) that is biased towards adding DEDs from the CurrentFile.

Indeed, one could modify the Preferred Embodiment such that the DDS iscleared and the DDS is then populated using the Current File rather thanthe Reference File.

XXXV. Iterating the Method; Creating Multiple Chapters (English)

An obvious extension of the Method that we shall claim is to use theExtended Reference File (ERF) created by this method to become one ofthe inputs to the next iteration. That is, for example, if on Tuesdaythere exists the Monday Reference File and the Tuesday Current File andon Tuesday evening a Tuesday ERF is created then on Wednesday we can usethe Tuesday ERF as Wednesday's Reference File. On Wednesday night,therefore, the Wednesday ERF would contain sufficient information torecreate any or all of Monday's Reference File, Tuesday's Current File,Tuesday's Reference File, Wednesday's Current File, and, of course,Wednesday's ERF.

We define Chapter to mean the collection of data necessary to logicallyor physically recreate the state of the Repository at a point in time.Thus, for the example in the previous paragraph, there would be threeChapters. The decorated Monday Reference File represents the firstChapter. The Tuesday ERF represents the second Chapter. The WednesdayERF represents the Third Chapter

This obvious extension has some non-obvious implications for how theMethod is to be adjusted to increase the probability that duplicate datais found. Specifically, there are several strategies that might be usedto increase the probability that duplicate data is found and thus reducethe size of the ERF.

In the current implementation of the Preferred Embodiment, theimplementation appends the data for each Chapter to the end of theprevious Extended Reference File. When the next Chapter is to becreated, the implementation reads the data added in the previous Chapterand adds the hash codes from that data to the DDS (i.e. in the PreferredEmbodiment, the hash table). By doing this, the data from all theChapters has some probability of being in the DDS.

Clearly, there are a variety of strategies that can be used to store andupdate the DDS. Those familiar with the art recognize that the DDS maybe saved in a separate file. The DDS may be continue to be updated byusing data from newer Chapters and biasing that toward the new DED's inthe manners described in, for example, Section (3.5).

We teach that it may be better to reduce the probability of later dataoverwriting earlier data so that data from all chapters has more-or-lessan equal probability of appearing in the DDS.

We refer the reader to Exhibit 10 for a sample structure of a File withfive Chapters. While Exhibit 10 refers to a File, this structure istrivially converted to a byte-oriented file.

XXXVI. Example Cases XXXVI. (1) Worst Case

As those familiar with the art of compression know, no compressiontechnique is guaranteed to produce compression. Indeed, under worst-caseconditions, every compression technique is guaranteed to produce outputthat is larger than its inputs.

Our Method is no exception and it is useful to understand this Method'slimitations.

The amount of compression that one is likely to get from our Method is,on average, dependent on the size of the DDS; the larger the DDS themore likely it is that our Method will be able to detect and eliminatecommon data redundancy.

Consider a DDS with zero entries. The Method will proceed, roughly, asfollows:

A pointless “Reference File Analysis Phase” would be done to build anon-existent DDS. As usual, the Reference File would be physicallyand/or logically copied to the Extended Reference File while digests areinserted into the DDS. Since the DDS has, by our example, no entriesthen the computation of digests is also pointless.

Once the Reference File Analysis Phase is complete the Method proceedsto the Current File Redundancy Elimination Phase (CFREP) in which theMethod searches (via the non-existent DDS) the Reference File forduplicate data.

In the degenerate case of the Preferred Embodiment, a B-byte-block isbrought in from beginning (byte position zero) of the Current File.Digests are built for every n-byte block in the B-byte-block and thenon-existent DDS is searched for matching hash codes. By definition, nohash codes will match and thus a decorated B-byte-block whose length isb-bytes plus the size of the decoration will be transmitted to theExtended Reference File.

The computations of the paragraph immediately above will be repeated foreach B-byte-block in the Current file.

If the user did not use our Method but merely kept the Reference andCurrent files, the total number of bytes consumed would, obviously, bethe sum of the file lengths of the Reference and Current Files.

In this, our worst-case scenario, the number of bytes used will be thesum of the file lengths of the Reference and Current Files plus thenumber of bytes for the decoration of the Reference file plus the numberof bytes for decorations of each B-byte-block of the Current File.

Thus the number of bytes that would exceed the number of bytes used ifthe user did not use our Method will be the number of bytes for thedecoration of the Reference file plus the number of bytes fordecorations of each B-byte-block of the Current File.

XXXVI. (2) Other Examples XXXVI. (2) (A) Identical Files with DDSContaining One Entry and Only Using One Pass Through the Current File

The next example will show a claimed improvement on the Method by usingmultiple passes through the Current file.

In this example we assume that the DDS contains a single entry and thatthe entry represents a position in the middle of the identical Referenceand Current Files.

In this, a single-pass version of the Method, the process proceeds justas it does in the example explained in [[0565] [0564] XXXVI. (1) Worstcase] up until the middle of the Current file.

As can be seen immediately, half of the Current File (with additionalbytes for decorations) will be written to the Extended Reference File.By definition, a B-byte-block is brought in and the only hash match isfound, then the program implementing the Method could run backwards fromthe middle position and detect that from the beginning of the two filesto the middle position and then to the end of the two files that the twofiles match completely.

Indeed, it is possible for a variant of this Method to “unwrite” thedecorated first half and replace this with a decorated pointer and alength.

This may be less than optimal under those circumstances where theExtended Reference File is being written to a WORM-like device. Thiswill waste WORM memory.

In the case where the data is being written to a WORM, writing data thatmay later be “unwritten” may exceed the device's capacity.

XXXVI. (2) (A) Identical Files with DDS Containing One Entry and UsingMultiple Passes Through the Current File

Where compression is more important than time, the user of a programimplementing this method might select a variant that has one or moreadditional passes over the Current File.

In this variant, we modify the DED in Exhibit 16

Exhibit 16 unsigned LowerHashValue; unsigned BlockNumberInReferenceFile;longlong GoodHashValue; longlong BlockNumberInCurrentFile;

By adding the field BlockNumberinCurrentFile and filling in this valuein a pass over the Current File, the Method can be modified to jump tothe position specified in BlockNumberinCurrentFile and work backwardsthrough the file looking for identical runs of data until a non-match isfound.

In the case of two identical files and a single entry in the DDS asdescribed, above, it should be apparent that this modification wouldallow the Method to detect that the Reference and Current Files areidentical and only require the addition of a few bytes to the end of thecopied Reference File to indicate that fact.

Thus this variant of the Method would be quite appropriate to minimizethe unnecessary writing of identical data.

XXXVII. Implementation as a Stand Alone Appliance

Instead of having the client computer use idle cycles to remove theredundancies as we have illustrated in our Method, it might be moreefficacious to have a stand-alone appliance perform the redundancyremoval. That is, the client computer (or, equivalently, the Appliance)copies the contents of the client computer's Repository to the Applianceand then the Appliance performs the redundancy removal as well asadditional functions to be described below. This variation of the Methodhas at least the following advantages.

The first obvious advantage is that the client computer need not devotecomputer cycles to the task of removing the redundancies. As is apparentto those familiar with the art, copying the contents of the Repositoryto an Appliance may well take less time than performing the Method onthe client computer.

The second non-obvious advantage is that the Appliance could transmitthe Reference File as well as the Delta Files to an offsite locationthus guaranteeing that the user can restore the user's data should alocal disaster to the client computer happen.

The third non-obvious advantage is that the Appliance could be attachedto a multiplicity of Client computers. A user selectable option could beto store only a single Reference File or a multiplicity of ReferenceFiles depending on how closely the data among and between the computersis similar. Of course, an obvious extension to this would be to allowthe user of the Appliance to select which Client computers share thesame Reference File. Another obvious extension is to allow the Applianceto select which of a multiplicity of Reference Files should be used tobecome the Reference File for a particular Client computer. Thesevariations on the Method will also be claimed.

The Appliance can also be programmed so that data in older Chapterscannot be modified. That is, the Appliance could become a variation ofthe Write Once Read Many (WORM) drive. This, too, will be claimed.

A fourth non-obvious advantage is that a separate appliance can havecustom hardware to make the redundancy elimination both faster and moresecure.

In terms of speed, it would make sense, for instance, to include customhardware in the appliance to compute hash codes.

In terms of security, it would be trivial to include custom code orcustom hardware in the Appliances firmware that would force theAppliance into a mode in which areas of its repository aresemi-permanently or permanently prevented from being rewritten. This,too, will be claimed.

As anyone who understands our Method should by now understand, there isno reason to ever rewrite areas of the Extended Reference File; rather,all changes to the Extended Reference File can be made by appending tothe end of the file. This makes it less likely that existing data willbe corrupted by subsequent changes, especially if the firmware orhardware can enforce this restriction, as mentioned in the previousparagraph. Thus our Method is ideal for archiving large numbers ofversions of a Client Computer's Repository.

It is an obvious extension of the Method to encrypt the data stored inthe Appliance. It is an obvious extension of the Method to encrypt theReference (or a copy of the Reference), Current, and Output Fileswhether or not they reside in an Appliance.

XXXVIII. Implementation as a File System

Because the Method may capture the state of a Repository at one or morepoints in time, it should be obvious to those familiar with the art thatan Appliance or, equivalently, a so-called device driver can present thecollection of Reference and Patch Files as a File System such as FAT orNTFS.

There are variations on this extension to the Method.

XXXVIII. (1) The Method Presented as File Systems

As those familiar with art of file system drivers know, one can simulatea physical disk via software and present to the appropriate level of anoperating system a simulation of that simulation so that a simulatedfile system becomes indistinguishable from a real physical device toupper levels of the operating system software.

As those familiar with the art understand, a physical disk can containseveral file systems. For instance, a physical disk can be partitionedinto NTFS and FAT file systems. Because our Method can capture and laterrecreate an entire physical disk at several points in time, one canimmediately see that several file systems can be restored at once.Similarly, operating systems like Windows can perform a so-called“mounting of a file system” operation for one or more partitions (thatis, file systems) on a particular physical disk. Because a physical diskcan be simulated in software, one can mount the one or more file systemsusing the simulation. Those familiar with the art know this as “mountinga disk in a file” or “mounting a virtual disk.”

If one iterates the Method as we describe in [0556] XXXV, then ourmethod can present each Chapter as a disk-in-a-file to which theoperating system can then mount the various file systems. Those familiarwith the art are familiar with the innumerable ways that such acollection of Chapters could be presented to a user in order for theuser to select one or more Chapters for mounting as virtual disks.

As those who understand the nature of the Method understand, in orderfor Chapters to be recreated successfully, the data that is in a Chaptermust not be modified. If the data in the Reference File or the variousPatch Files is accidentally or maliciously modified then there is asubstantial probability that all data in all Chapters will be corrupted.

Thus stored Chapters are read-only. Nonetheless, a Chapter can bemounted as a collection of one or more File Systems in one of two ways:read-only or writeable. If mounted as read-only then the operatingsystem will reject any attempt to write to the Chapter.

Writeable Chapters can come in one of two variants. The first variantallows all writes and any new data creates a new Chapter. The secondvariant allows any new data to be written but all new data is discarded.

XXXVIII. (2) The Appliance as a File System

If one connects Apple's iPod™ to a Windows™ computer, those familiarwith the art will see the iPod presented to the Windows operating systemas a file system.

Similarly, we teach that it would simple for those familiar with the artto use a similar interface to present the Appliance as if the variousChapters were separate File Systems.

As anyone familiar with the art should readily understand, there are amultitude of possible user interfaces that would allow a user to selectone or more Chapters.

One simple user interface would be nothing more than an up-and-downbutton as well as an alphanumeric display allowing the user to select anexisting Chapter.

Another possibility is for there to be a command channel controlled bythe user's computer or other device external to the Appliance that wouldallow the user to set a wealth of options in the Appliance. One of thoseoptions could be which Chapter to present to the user's computer as themountable virtual disk.

XXXIX Simplified Description of the Method as Implemented andIllustrated in the Preferred Embodiment

(1) The Reference File is logically broken up into n-byte (e.g.256-byte) sections. If the Reference File is a megabyte (1,048,576bytes) then there will be 4096 256-byte sections. That is, in this case,NN would be 4096.

(2) The number of entries that can be stored in the DDS is computed andthe number of n-byte blocks is computed. Then the Subscription Ratio(SR) can be computed which is roughly the number of n-byte sections (NN)divided by the number of slots available in an empty DDS.

(3) DDS population: There are various strategies to populate the DDS.All the strategies will fill (or mostly fill) the DDS and have in commonthe fact that there are more hash keys and values to insert (NN) thanthere is room in the DDS (DDSMHC). On average, once every SR n-bytesections, the Database Mapping Engine inserts a computationallyexpensive CRC64 as well as a computationally inexpensive Rolling Hashinto the DDS.

In the Preferred Embodiment, when a Chapter is closed, the DDS is storedin a separate file. This file can be read in order to load the DDS forthe new Chapter being created.

(3.1) In the Preferred Embodiment in which there is no Good Hashcomputing specialized hardware, the Database Mapping Engine isimplemented as a hash table. The fields of the hash table comprise aRolling Hash as well as a Good Hash. The values associated with the keysare as described in [0526] XXXIII.(4).

(3.2) As implemented in the Preferred Embodiment, SR is set to 1 so thatthe CRC64 and the Rolling Hash are computed for each n-byte block. Inour example, 4096 entries are inserted into the hash table. If the slotin the hash table is occupied, it is overwritten thus biasing the hashtable to contain later entries. This is just one of many strategies topopulate the DDS with a subset of the available hash codes.

(3.3) Another example of a possible strategy to populate the DDS wouldbe to insert every SRth entry into the DDS.

(3.4) Another example of a possible strategy to populate the DDS wouldbe to use a random number generator to select on average an insertiononce every SRth entry.

(3.5) Another example of a possible strategy to populate the DDS wouldbe to use a random number generator to select on average an insertiononce every (SR-x)th entry where x is some “biasing value” so that theDDS is biased towards entries at the beginning, middle, or end; perhapsdepending on a User Option.

(3.6) Should the computer in question have specialized hardware torapidly compute Good Hashes, it then becomes unnecessary for the Methodto compute and/or store the Rolling Hash into the DDS since the GoodHash can be used anywhere in this method that a Rolling Hash can beused. In effect, the Rolling Hash is a computational proxy for a GoodHash but because of a Rolling Hash's poor statistical properties, theMethod will compute a Good Hash when appropriate as described, below.

(4) Having populated the DDS using the Extended Reference File, thePreferred Embodiment then proceeds to the next phase and begins readingin the Current File in chunks of BN bytes. BN need not be fixed nor dothe blocks have to be read in sequential order although this is the mostnatural choice.

(4.1) Where the computation of the Good Hash is computationallyexpensive and/or no specialized hardware to compute a Good Hash isavailable:

(4.2.1) The Preferred Embodiment will compute a Rolling Hash of lengthN-byte (e.g. 256 bytes) for all BN-N-Byte+1 (e.g. 256K-256+1==261889)overlapping blocks in the B-block.

(4.3.2) If the Rolling Hash matches any Rolling hash in the DDS then athe Preferred Embodiment will select one of several strategies—typicallyvia a User Option—to determine if a (computationally expensive) GoodHash is to be computed.

(4.4.2.1) The simplest strategy is to simply compute a Good Hash onceany Rolling Hash matches.

(4.5.2.2) Because Good Hashes may be very computationally expensive, analternate and clever alternative is to continue computing Rolling Hashesfurther into the Current File. If two or more Rolling Hash matches arefound in the right sequence and at the right distances to correspond toa run of blocks in the Reference File, then a Good Hash is computed forone or more of the corresponding Rolling Hashes.

(4.6.3) Once the Good Hash(es) has/have been computed, an attempt ismade to find the Good Hash(es) in the DDS in the following manner. Weknow that there is a Rolling Hash match. The Good Hash computed for theCurrent file is compared to the Good Hash in the DED corresponding tothe Rolling Hash.

(5) If the Good Hashes match then a block of data is fetched from theReference File. In the Preferred Embodiment, this block is of size N butvirtually any size block would be usable as long as the block containedthe n-byte-block of the corresponding Good Hash.

(6) Having brought in the Reference File block, the Method now searchesbackwards and forwards for matching byte patterns by performing anExpanding-The-Run operation. An Expanding-The-Run operation may requireadditional I/O operations in order to compare bytes in the ExtendedReference File and the Current File. We teach that the matching patternsdo not have to be exact. Those familiar with the art are fully familiarwith so-called binary-difference algorithms that can emit differencedata for two binary patterns that are similar. See Section [0348] XXI.In the Preferred Embodiment, the program stops at the first mismatchingbyte in each direction of search.

(7) Having detected a Expanded Run, the information regarding therun—and thus the data duplication elimination information—is emitted tothe Extended Reference File as a Matching Data Packet. At a User Option,the Method and the corresponding implementation may adjust previousentries in the Extended Reference File to indicate that previouslystored unique data is now part of a Run.

(8) If no Run is found, the decorated unique data is emitted to theOutput File as a Literal Data Packet (LDP).

XL. Analyze and Potentially Modify the Extended Reference File to RemoveInternal Redundancy

In the Preferred Embodiment, the base Reference File is analyzed toremove internal redundancies. In a previous implementation of thePreferred Embodiment the Reference File was not analyzed to remove thenon-local redundancies. As is obvious to anyone familiar with the art,whether or not the base Reference File is scanned for redundancies couldeasily be made into a User Option.

XLI. Adding the Data from the Current File (I.E. Chapter) to the DDSDuring an Analysis

In the Preferred Embodiment, the DDS of the Extended Reference File iskept in RAM until it is written out after the analysis of the CurrentFile. In the Preferred Embodiment, the DDS is stored in a separate file.Equivalently, the DDS may be stored as part of the Extended ReferenceFile.

XLII Using Variable Length N-Byte-Blocks XLII.1 Overview

In this entire specification we have assumed that n-byte-blocks all havethe same size, namely, N-bytes. This need not be the case.

As those familiar with the art will readily understand, each block ofdata for which a DED is created need not be the same length.

The motivation for having variable-length-n-byte-blocks (VLNBBs) mightbe, for instance, that the user knows that the data at the beginning ofa Current File has many small duplicate blocks that would not bedetected with a large N-byte value but that the data at the end of theCurrent File has long Runs.

Thus the user could set a User Option that would

1. set the size of N-bytes from 512 bytes at the beginning of theReference File Analysis Phase (RFAP) as the Extended Reference File isprocessed, and then

2. set the size of N-bytes to 8K when, say, half the Extended ReferenceFile has is processed.

As should be obvious to those familiar with the art, fixed lengthn-byte-blocks are a special case of variable length n-byte-blocks

XLII.2 Creating Variable Length N-Byte-Blocks in the Reference FileAnalysis Phase (RFAP)

Having VLNBBs adds a bit of complication. Somehow the length of theblock would have to be stored in the DED or the DDS.

For instance, one possible strategy might be to have two DDS's. One DDSwould have DEDs that have N-bytes equal to 512 while the other hasN-bytes equal to 8K.

In the alternative, one could implement a DDS in which every DED has theassociated “n-byte-block length” as a field.

XLII.3 Handling VLNBBs During the Current File Redundancy EliminationPhase (CFREP)

Using variable length n-byte-blocks also creates additional complexityto the CFREP,

Conceptually (but not in actuality), in the Preferred Embodiment, ateach byte of the Current File, a hash code is computed for then-byte-block starting at said byte.

That is, for instance, if N-bytes is 512 then at byte position 100 inthe Current File, a hash code for bytes 100 through 611 would becomputed and the hash code so developed would be the key by which theDDS would be searched for a matching hash value.

In the Preferred Embodiment, the size of N-bytes is fixed and thus nospecial handing for variable length blocks would be needed.

XLII.4 Creating the Variable-Length-N-Byte-Block-Length-Table (VLNBBLT)

In the variant where N-bytes might be variable length, the Current FileRedundancy Elimination Phase (CFREP) would need to know all the possiblevalues of N-bytes.

On way to do this is to keep a table of the unique values of N-bytes asthe DDS is being developed. In our example above, such a table wouldhave two entries: 512 and 8K. We call this table thevariable-length-n-byte-block-length-table (VLNBBLT).

Another way to create the VLNBBLT is to scan the DDS for all possibleunique values of N-bytes.

XLII.5 Using the Variable-Length-N-Byte-Block-Length-Table (VLNBBLT)

For the purposes of this discussion, let us assume that only two uniquevalues of N-Bytes are in use.

Thus at byte position 100 in the Current File, two hash code valueswould be developed:

1. A hash code for the 512-byte block of bytes 100 through 611, and

2. A hash code for the 8 k block of bytes 100 through 8291

Each of these two hash codes would be used as a key to search the DDS.If one of more hash code matches were found then the Method wouldproceed identically to the process described many times, above (e.g.section [0431] XXX), in which the block of data pointed to in the DED iscompared to the block of data in the Current File and if there is amatch then an Expanding-The-Run operation is performed, etc.

XLIII Using High Performance Caches by Using a DED Access Accelerator

As those familiar with the art know, locality of reference can makesoftware run dramatically faster on certain kinds of hardware.

As we have taught in [0234] X, access to disk memory can be a milliontimes slower than access to RAM. Similarly, access to the RAM MemoryHardware Cache (RMHC) (e.g. the L2 cache) can be many times faster thanaccess to RAM. Thus, if data that is often and repeatedly accessed canbe kept in the RMHC then the Method can be made to run faster.

In the Preferred Embodiment, if the ERF and the Current File sharelittle duplicate data then roughly one Rolling Hash is computed for eachbyte of the Current File. For each Rolling Hash computed, the DDS issearched for a DED with a matching hash value. In one implementation ofthe Preferred Embodiment, this is accomplished by comprising thefollowing steps

1. Convert the Rolling Hash value to a hash index by using the bottom 21bits of the hash value as the hash index (thus producing an index valuebetween 0 and 2,097,151).

2. Using the hash index to point into the hash table of DEDs.

3. Comparing the Rolling Hash calculated from the Current File forequality with the Rolling Hash field stored in the DED pointed to in theprevious step.

Analysis of the implementation of the Preferred Embodiment shows that alarge fraction of the execution time of the program is spent in the hashtable lookup as described in the previous paragraph.

It is well known to those conversant with the art that hash codedaccesses to memory exhibit very poor locality of reference. That is, itis in the nature of hash codes that given random data to hash that thehash index derived from the hash code will be, effectively, a randomnumber.

For a Current File that has few matches in the ERF, this results in aneffectively random memory access for nearly every byte in the CurrentFile.

Assuming that the Current File is billions of bytes long, then in theimplementation of the Preferred Embodiment, experiment shows that thisapproach causes the performance of the implementation to be limited bythe time taken by these billions of random accesses to main memory,which prevents efficient use of processor resources.

We would expect to see a dramatic improvement in performance of thePreferred Embodiment if the entire DDS could be stored in, say, the L2cache. But the DDS is, generally, many times larger than the L2 cache onmodern (2009) workstation computers.

The Inventors have devised a method by which many of these randomaccesses to RAM can be avoided by, instead, accessing the data in themuch faster RAM Memory Hardware Cache (RMHC) by creating and using adata structure which we call a DED Access Accelerator (DAA).

In the Preferred Embodiment, what the Inventors did is to create aone-byte Proxy for each DED in the DDS. This array of Proxies is theDAA.

In the preferred embodiment, the one-byte Proxy is the low order byte ofthe Rolling Hash stored in the DED. It could, of course, be some othercollection of bits of the Rolling Hash but a single byte is convenient.A nybble (4 bits) is another obvious size for a Proxy.

Whenever the main hash table (DDS) is loaded into memory or is updatedin the course of operation of the Method, this DAA is populated with theone-byte Proxy of the Rolling Hash stored in the corresponding entriesin the DDS. If the Proxy in the DAA entry does not match thecorresponding Proxy of the Rolling Hash calculated for a particularposition in the b-byte block, it is not necessary to access thecorresponding DDS entry, as the whole Rolling Hash cannot match if theProxy for the Rolling Hash does not match.

The DAA and the DDS must be kept in sync. Thus, when the DDS is updated,the DAA is updated as well.

The performance gain comes from the fact that so long as the DM is smallenough to fit in the RMHC, (e.g. the L2 cache used by many modernprocessors to hold recently-used portions of memory), accessing the DAAis many times as fast as accessing random locations in main memory. Wecan predict that most of the DAA will remain in the L2 cache becauseentries in the DM are being accessed far more frequently than any otherdata in the inner loop of the program.

As those familiar with the art will realize, and assuming that the DAAcannot explicitly be locked into the RMHC, such an approach will workoptimally only if there is (at least) one entry in the DAA for eachentry in the DDS; if that relationship does not obtain, then the DDSwill have to be accessed for those DDS entries that do not have an DMentry, thus negating the effectiveness of this Method to avoid suchaccesses. This limitation arises from the fact that frequent accesses toentries in the DDS will cause the memory locations representing thoseDDS entries to replace DAA entries in the RMHC.

However, this requirement imposes a limitation on the size of the DDS.For example, if each Proxy in the DM is one byte of the correspondingDDS Rolling Hash entry, then the number of DDS entries cannot exceed thenumber of bytes in the RMHC (e.g. L2 cache, typically 1-4 MB in circa2009 commodity computers). If DAA entries are smaller than one byte,then this limitation can be relaxed correspondingly, but in order tohave a reasonable reduction in random accesses to main memory RAM, thesubset of the Rolling Hash stored in the DM must be at least a few bits.With a one-byte DM entry, on the average only 1 in 256 DDS entries mustbe accessed (because the one byte subset of the Rolling Hash in the DAAwill have a “false match” to the corresponding Proxy in the calculatedRolling Hash with that frequency, on average). With a 4-bit DAA Proxy, 1in 16 DDS entries will need to be accessed, and of course with a 1-bitDM Proxy, 1 in 2-DDS entries will need to be accessed, eliminating mostof the performance gain from this method.

The applicability of this way of improving performance thus depends to alarge extent on the fact that there are not too many DDS entries for acorresponding DAA (with reasonably sized entries) to fit in the RMHC.

In the Preferred Embodiment, to use this approach we limit the number ofentries in the DDS to 2,048,576 and set the DAA entry size to one byte,so that the whole DAA occupies that same number of bytes, most of whichwill reside in the L2 cache on a computer with at least 2 MB of L2cache. With a 2048 byte n-byte block, a 256K b-byte block, and anoversubscription ratio of 50-1, this provides good deduplicationperformance for files of sizes up to approximately 200 GB.

It is worth noting that the performance implications of random access toRAM are not trivial; in the Preferred Embodiment, the observed increasein overall throughput is up to 50% for 32-bit code and approximately150% for 64-bit code. The Inventors have occasionally observedperformance throughput increases of over 200% compared to the Method notusing the DM.

However, deduplication approaches that do not use oversubscriptioncannot use this approach fruitfully because they cannot limit the numberof entries in the DDS to nearly the extent that the Method can. Forinstance, with the same parameters as above but a subscription ratio of1-1 rather than 50-1, a deduplication implementation would be limited toa 4 GB file rather than a 200 GB file, which would greatly limit itsapplicability in commerce. So in the real world, it is not possible toachieve the performance gains that this optimization confers withoutusing oversubscription as claimed in the Inventors' Method.

As those who are familiar with the Art can see, this technique is notlimited to data deduplication but to any time that a database (e.g.table) needs to be searched for a matching value and in which the arrayof Proxies for the elements in the database can fit into a RAM cache(e.g. L2 cache).

Let us give a contrived example of how this could be used in anenvironment that is not a data deduplication environment.

Let us say that we have a customer database of 500,000 customers inwhich the fields comprise

1. Social Security Number

2. Customer Name

3. Customer Name Hash

4. Address

Thus the DED comprises the fields, above.

The DDS is the collection of DEDs and is implemented as a hash table.The hash table is indexed by converting the Customer Name Hash into aHash Index. Further assume that there are no hash table collisions inthe DDS which can store up to 2 million entries. It is feasible toassume this lack of collisions as an approximation, since there is anundersubscription of 4:1.

From the Customer Name hash, we extract a one-byte Proxy and store thisproxy into the DAA at the Hash Index position.

And let us say that we also have a list of a billion customer names andwish to determine which of these billion names exist in this database of500,000 customers.

For each of the billion customer names, a customer name hash isdeveloped. From the customer name hash, a hash index is developed. Fromthe customer name hash, a Proxy is developed.

The DAA which, presumably, is now in the RAM Memory Hardware Cache(RMHC), e.g., the L2 cache, is indexed via the Hash Index. The Proxy atthe Hash Index position in the DAA is fetched and compared to the Proxydeveloped from the name in the list of a billion names. If the twoproxies do not match then the name does not exist in the DDS of 500,000customers.

It the proxies do match, then the implementer could compare CustomerName Hashes or go directly to comparing customer names by a stringcompare as the implementor sees fit.

By keeping the DAA in the RMHC significant performance gains can beachieved.

Certain RMHC implementations respond to software commands to keepcertain data semi-permanently locked in the RMHC. Thus, in oneimplementation of the Method, one could use such software commands tosemi-permanently lock the DAA into an RMHC implemented as an L2 cache.

We now explain Diagram 1.

When a hash table is large, frequent random accesses to it can causesignificant performance degradation due to the lack of locality ofreference. To mitigate this problem, it is possible to use a DAA. Hereis a diagram of how this could work, assuming we have a hash tablecomposed of entries each with the following content, and which areindexed by the Rolling Hash value:

Name Length Description Good Hash 8 bytes Hash code with goodstatistical properties Rolling Hash 4 bytes Hash code that is easy tocomputer Block Number 4 bytes Block number in ERF to which this entryrefers

If we have 2A21 (approximately 2 million) entries in this hash table,and each entry occupies 16 bytes, then the total storage required forthe table is approximately 32 Megabytes, which is considerably largerthan the L2 cache in most commodity workstation computers circa 2009.Thus, the poor locality of reference of hash accesses will cause manyreferences to main memory with the associated performance degradation.

DAA mitigates this problem by taking a subset of the index (one byte ofthe Rolling Hash value, in this case), the Proxy, and copying it to anarray that will fit in the L2 cache, here assumed to be at least 2 MB insize.

See Diagram 1. All of the values are in hexadecimal for convenience inshowing byte boundaries:

XLIV Improvement in the Storage of Hash Codes into a Hash Table

We now teach a Method for improving the storage of hashes into a hashtable. Those familiar with the art will understand that the values inthe example, below, are illustrative rather than specific.

For the method to work, the hash table must have a number of entriesthat is a power of 2; for example, 2**24 (=16777216) entries. Clearly,the hash table might have, say 16777219 entries, but then three of theentries would be unneeded by the Method we teach.

By using some arbitrary-but-consistent 24 bits of the hash code as theindex into the hash table and then storing the rest of the hash code inthe hash table, those 24 bits need not be stored, thus saving nearlyhalf the 64 bits of the hash code.

For instance, if the hash code is 64 bits, and the hash table is 2**24in size, then we can recover 24 bits of the hash code by simply notingthe hash's position in the hash table.

As a practical matter in order to maintain alignment in RAM, what wouldlikely be done is that 32 bits of the hash code would be stored in oneof the 2**24 positions and the remaining 8 bits of hash code would bethrown away.

Obviously, when comparing hash codes, only 56 bits of the hash codewould be compared if only 56 bits could be recovered from the hashtable.

Preferred Embodiment (Best Mode) Source Code and Explanation XLV.Overview of the Purpose of the Source Code

In this section of the specification, the Inventors will teach severaloptimizations as implemented in the Preferred Embodiment. Theseoptimizations are commented in the source code listings that can befound on the compact disc submitted to the USPTO and we include thismaterial as an incorporation-by-reference.

The source code listings are code fragments extracted from a workingimplementation of the Method.

It is standard C++ and compiles and executes successfully usingMicrosoft Visual Studio 2008.

The source code will not compile successfully; but is, instead, to beused to understand how the Method works as implemented in the PreferredEmbodiment.

XLVI. Definitions Needed to Understand the Source Code

Theoretical Starting Block Number (TSBN): is the block number in the ERFthat the beginning of this b-block would match to, if we had a perfectmatch from this point back to the beginning of the b-block. The TSBN isused in the implementation of the algorithm described in the sectionentitled “Hash Codes in the right order and at the right distances”. TheInventors have coined this phrase.

Sequential Block Map (SBM): is the data structure used to keep track ofthe Theoretical Starting Block Numbers (TSBNs) and associated data(“strength” of the entry, starting and ending positions in the b-byteblock) for a b-byte-block. The Inventors have coined this phrase.

Match Finding Result Map (MFRM): is a map containing one entry for eachof the Rolling Hash matches that were found for a specific b-block,together with their respective ERF block numbers and Good Hashes. TheInventors have coined this phrase.

Access Vector: is an object type that acts like a normal C++ vector,except that it doesn't allocate memory. Its purpose is to allow one touse the [ ] notation to access a contiguous subset of elements of apre-existing array or vector as though it were a vector in itself. TheInventors have coined this phrase but the concept is well understood bythose familiar with the art.

XLVII. Commented Source Code

The flow of the Preferred Embodiment when adding a Current File to theERF (and ignoring edge cases and some optimizations for simplicity) isas follows:

A. The main program calls a function namedFindMatchingSegmentsAndAddThemToArchive

B. This function reads in each b-byte-block from the Current File, thencalls a function named FindAllMatchesinProcessingBlock. This functionexecutes the following algorithm:

1. For each element in the b-byte block that can start an overlappingn-byte block, the function sets the value “i” to the location in theb-byte block where such a n-byte block begins.

2. For each such value of “i”:

a. This function examines the overlapping n-byte blocks in the b-byteblock, one at a time, and calculates a Rolling Hash for each such block.

b. It then calculates the index that the Rolling Hash corresponds to inthe DAA and the DED.

c. It then checks that entry in the DAA to see if the Proxy for theRolling Hash it has just calculated matches the Proxy for thecorresponding DED entry. If not, then it concludes that it does not haveto access the DED, so it skips all the steps up to but not including theRolling Hash update.

d. If the Proxies do match, it then retrieves the DED entry that hasthat index.

e. If that entry has been initialized with an ERF block number and GoodHash, it retrieves those values from that entry.

f. It then calculates a Theoretical Starting Block Number (TSBN) forthis location in the b-byte-block.

g. Then it looks in the Sequential Block Map (SBM) to see if there is anentry in that Map corresponding to this TSBN.

h. If not, it creates such an entry, containing the TSBN, an initial“strength” of 0, and the current value of “i” as both starting andending offset values.

i. If there already is such an entry, the code increments the “strength”of this SBM entry and updates its ending location to the current valueof “i”.

j. Then the function adds an entry to the Match Finding Result Map,which is an associative array keyed by the value “i”. Each entry in thismap contains the ERF n-block number and the Good Hash that was retrievedfrom the DED for position “i” in the b-byte-block.

k. Finally, the Rolling Hash is updated to the Rolling Hash valuecorresponding to the overlapping n-byte block (that is, the n-byte blockstarting with the next value of “i”).

3. Use the SBM to remove unnecessary and/or inefficient short

Runs.

4. Return the MFRM to the calling function.

C. The function FindMatchingSegmentsAndAddThemToArchive continues bycalling a function called ProcessPossibleMatches. This function executesthe following algorithm: for each overlapping n-byte block in thecurrent b-byte block:

1. For each element in the b-byte block that can start an overlappingn-byte block, the function sets the value “i” to the location in theb-byte block where such a n-byte block begins.

2. For each such value of “i”:

a. See if we have a Run of zeroes. If so, process it and continue.

b. Otherwise, look in the MFRM to see if we have an entry for this valueof “i”.

c. If so, then calculate a Good Hash for the n-byte block at position“i” in the b-byte block.

d. If this matches the Good Hash found in the MFRM entry, then call afunction called MatchOnlyWhatWeNeedTo. This function reads data from theERF and Extends the Run as far as possible in both directions in the ERFand returns the starting and ending positions of that Run, or anindication of a “false match” if no Run is found.

e. If the return value from MatchOnlyWhatWeNeedTo indicates that we havefound a Run, we call ProcessUsableMatches to emit a Matching Packetcorresponding to that Extended Run.

f. Finally, after control returns from ProcessUsableMatch,ProcessPossibleMatches adjusts the value of “i” to skip over the segmentof the b-byte block that has been accounted for by ProcessUsableMatch.

XLVII.1 FindMatchingSegmentsAndAddThemToArchive( )

Referenced, below, we find the relevant section of the function thatfinds matching Runs of data and adds them to the ERF (e.g. the“Archive”).

This listing can be found on the compact disc submitted to the USPTO andwe include this material as an incorporation-by-reference:

Exhibit 17 File: FindMatchingSegmentsAndAddThemToArchive.txt Creationdate: Jun. 8, 2009 Size: 2017

XLVII.2 FindAllMatchesinProcessingBlock( )

Referenced, below, is the relevant section of the function that findsRolling Hash matches in the Current File b-byte buffer:

This listing can be found on the compact disc submitted to the USPTO andwe include this material as an incorporation-by-reference:

Exhibit 18 File: FindAllMatchesInProcessingBlock.txt Creation date: Jun.8, 2009 Size: 4833

XLVII.3 ProcessPossibleMatches( )

Referenced, below, is the code for the function that determines whichRolling Hash matches should be looked up in the ERF. Note that thereturn value from this function is not currently used.

This listing can be found on the compact disc submitted to the USPTO andwe include this material as an incorporation-by-reference:

File: ProcessPossibleMatches.txt Creation date: Jun. 8, 2009 Size: 4679

XLVII.4 MatchOnlyWhatWeNeedTo( )

Referenced, below, is the code for the function that generates thelargest match possible:

This listing can be found on the compact disc submitted to the USPTO andwe include this material as an incorporation-by-reference:

Exhibit 19 File: MatchOnlyWhatWeNeedTo.txt Creation date: Jun. 8, 2009Size: 3780

Although this invention has been described with respect to specificembodiments, it is not intended to be limited thereto, and variousmodifications which will become apparent to the person of ordinary skillin the art are intended to fall within the spirit and scope of theinvention as described herein, taken in conjunction with theaccompanying drawing and the appended claims.

What is claimed is:
 1. A method of reducing redundancy between two ormore data sets comprising: generating a plurality of first hash codesfor a plurality of data blocks associated with one or more referencefiles, wherein the first hash codes are generated with a first hashalgorithm executing in one or more computer processors; storing one ormore of the plurality of first hash codes in one or more hash entries ina hash table; using the first hash algorithm to compute at least a firsthash code for a current data block associated with a current file;comparing the first hash code associated with the current data blockwith the one or more first hash codes stored in the one or more hashentries in the hash table; when the first hash code of the current datablock matches at least one of the first hash codes in the one or morehash entries in the hash table, generating a second hash code for thecurrent data block, wherein the second hash code is generated with asecond hash algorithm that is computationally more expensive than thefirst hash algorithm; and when the second hash code for the current datablock matches a second hash code associated with the first hash entry,comparing the data in at least one of preceding and succeeding datablocks of the current and reference files to identify a matching run ofdata in the current and reference files.